WO2007041238A9 - Methods of identification and use of gene signatures - Google Patents

Methods of identification and use of gene signatures

Info

Publication number
WO2007041238A9
WO2007041238A9 PCT/US2006/037916 US2006037916W WO2007041238A9 WO 2007041238 A9 WO2007041238 A9 WO 2007041238A9 US 2006037916 W US2006037916 W US 2006037916W WO 2007041238 A9 WO2007041238 A9 WO 2007041238A9
Authority
WO
WIPO (PCT)
Prior art keywords
gene
genes
subset
analysis
expression
Prior art date
Application number
PCT/US2006/037916
Other languages
French (fr)
Other versions
WO2007041238A2 (en
WO2007041238A3 (en
Inventor
Guennadi Victor Glinskii
Original Assignee
Stratagene California
Guennadi Victor Glinskii
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stratagene California, Guennadi Victor Glinskii filed Critical Stratagene California
Publication of WO2007041238A2 publication Critical patent/WO2007041238A2/en
Publication of WO2007041238A9 publication Critical patent/WO2007041238A9/en
Publication of WO2007041238A3 publication Critical patent/WO2007041238A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/118Prognosis of disease development
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the invention relates to methods of identifying a set of genes that can predict a phenotype, and the use of these gene sets for predicting a phenotype of interest.
  • a mouse/human comparative translational genomics approach was utilized to identify an 11-gene signature distinguishing stem cells with normal self-renewal function versus stem cells with drastically diminished self-renewal ability due to the loss of the BMI-I gene; this signature was then used to interrogate and interpret expression patterns of human cancers (Glinsky et al., 2005, supra).
  • the 11-gene signature consistently displays a normal stem cell- like expression profile in distant metastatic lesions as revealed by the analysis of metastases and primary tumors from a transgenic mouse model of prostate cancer and cancer patients.
  • the prognostic power of the 11-gene signature was examined in several independent therapy outcome sets of clinical samples obtained from 1153 cancer patients diagnosed with multiple types of cancer, including five epithelial (prostate; breast; lung; ovarian; and bladder cancers) and five non-epithelial (lymphoma; mesothelioma; medulloblastoma; glioma; and acute myeloid leukemia, AML) malignancies.
  • the invention provides for a method of generating a subset of genes for use in predicting a phenotype in a subject.
  • the method comprises the steps of obtaining a set of expression values for a set of genes in a first sample and a second sample by measuring the level of expression in the two samples.
  • a set of genes that are differentially expressed are identified by comparing the level of expression in the first sample with the level of expression in the second sample.
  • An expression value that is increased or decreased in the first sample, as compared to the second sample is differentially expressed.
  • a subset of genes for use in predicting a phenotype in a subject, wherein the subset is equal to or smaller than the set is then identified by performing multivariate Cox analysis on the expression values for the set of genes which are differentially expressed.
  • the method further comprises the step of obtaining a relative weight coefficient for each member of the gene set.
  • the method further comprises the steps of obtaining a relative weight coefficient for each member of the gene set; and multiplying the expression value by the relative weight coefficient to obtain an individual survival score for each member of the gene set.
  • the sum of the individual survival scores is calculated to obtain a survival score.
  • the method includes the step of logarithmically transforming the expression value of each member of the gene set prior to performing the multivariate Cox analysis.
  • the method comprises the steps of logarithmically transforming the expression value of each member of the gene set; obtaining a relative weight coefficient for each member of the gene set; and multiplying the logarithmically transformed expression value by the relative weight coefficient to obtain an individual survival score for each member of the gene set. The sum of the individual survival scores is calculated to obtain a survival score.
  • the invention also provides for a method of generating a subset of genes for use in predicting a phenotype in a subject comprising the following steps.
  • a set of expression values for a set of genes in a first sample and a second sample is obtained by measuring the level of expression in the first sample and the second sample.
  • a set of genes that are differentially expressed is identified by comparing the level of expression in the first sample with the level of expression in the second sample. An expression value that is increased or decreased in the first sample, as compared to the second sample is differentially expressed.
  • a subset of genes for use in predicting a phenotype in a subject is identified by performing multivariate Cox analysis on the expression values for the set of genes which are differentially expressed.
  • a relative weight coefficient is obtained for each member of the gene set.
  • the expression value of each member of the gene set is multiplied by the relative weight coefficient to obtain an individual survival score for each member of the set of genes.
  • the sum of the individual survival scores is calculated to obtain a survival score. Survival analysis, is performed.
  • the method may comprise the step of logarithmically transforming the expression value of each member of the gene set prior to performing the multivariate Cox analysis.
  • the method of identifying a subset of genes comprises the additional steps of: identifying genes with a p value as determined by multivariate Cox analysis that is less than or equal to 0.25; obtaining a relative weight coefficient for each member of the gene set, multiplying the expression value of each member of the gene set by the relative weight coefficient to obtain an individual survival score for each member of the set of genes; calculating the sum of the individual survival scores to obtain a survival score; and performing survival analysis.
  • the steps of identifying a set of genes that are differentially expressed, identifying a subset of genes for use in predicting a phenotype by performing multivariate Cox analysis, obtaining a relative weight coefficient for each member of the gene set, obtaining an individual survival score for each member of the gene set and obtaining a survival score are repeated.
  • the method further comprises the following steps. Genes with a p value as determined by the multivariate Cox analysis that is less than or equal to 0.25 are identified. Multivariate Cox analysis is performed on the set of genes identified, wherein a gene with a p-value that is less than or equal to 0.1 is included in the subset. A relative weight coefficient for each member of the gene set is obtained. The expression value of each member of the gene set is multiplied by the relative weight coefficient to obtain an individual survival score for each member of the set of genes. The sum of the individual survival scores is calculated to obtain a survival score. Survival analysis is performed.
  • the steps of identifying genes with a p value that is less than or equal to 0.1; obtaining a relative weight coefficient for each member of the gene set, obtaining an individual survival score for each member of the gene set; obtaining a survival score and performing survival analysis are repeated.
  • a gene with a p-value, as determined by multivariate Cox analysis that is less than or equal to 0.25, that is less than or equal to 0.1, that is less than or equal to 0.075 or that is less than or equal to 0.05 is included in the subset.
  • the method may include the step of performing survival analysis, for example, Kaplan-Meier analysis.
  • a gene with a p-value, as determined by Kaplan-Meier analysis, that is less than or equal to 0.1, that is less than or equal to 0.075 or that is less than or equal to 0.05 is included in the subset.
  • the method can be performed with any of the sets of genes identified in Figures 3-7 and Table 3.
  • the subset of genes includes at least one gene of any of the subsets identified in Figures 3-7 and Table 3.
  • the invention provides for a method of using a subset of genes to predict a phenotype in a subject comprising the following steps. A sample is isolated from a subject; and analyzed for expression of the subset of genes.
  • the phenotype is selected from the group consisting of disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non- recurrence, invasiveness, non-invasiveness, metastatic, non-metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, and disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure.
  • the subset of genes is any one of the sets or subsets identified in Figures 3-7 and Table 3.
  • the invention also provides a method of determining the relevance of a set of genes comprising the following steps.
  • a set of expression values for a set of genes in a first sample and a second sample is obtained by measuring the level of expression in the first sample and the second sample.
  • a set of genes that are differentially expressed is identified by comparing the level of expression in the first sample with the level of expression in the second sample, wherein an expression value that is increased or decreased in the first sample, as compared to the second sample is differentially expressed.
  • a subset of genes for use in predicting a phenotype in a subject, wherein the subset is equal to or smaller than the set is identified by performing multivariate Cox analysis on the expression values for the set of genes which are differentially expressed.
  • a relative weight coefficient is obtained for each member of the gene set.
  • the expression value of each member of the gene set is multiplied by the relative weight coefficient to obtain an individual survival score for each member of the set of genes.
  • the sum of the individual survival scores is calculated to obtain a survival score; and survival analysis is performed.
  • the invention also provides for a subset of genes comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3.
  • the invention also provides for a subset of genes for use in predicting a phenotype of a subject comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3.
  • the invention also provides for a subset of genes comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3, wherein the subset of genes is generated by the methods described herein.
  • the invention also provides for a composition comprising a set of probes that hybridize to at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3.
  • the invention also provides for a subset of genes generated by the methods described herein.
  • the invention also provides for a combination of gene subsets, including a combination comprising at least two of the subsets presented in Figures 3-7 and Table 3.
  • each subset of the combination comprises at least one gene of any of the subsets identified in Figures 3-7 and Table 3.
  • kits comprises at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3-7 and Table 3.
  • a kit comprises a set of reagents for detecting the expression of at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3- 7 and Table 3.
  • the kit comprises a set of probes that hybridize to at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3- 7 and Table 3.
  • any one of the kits of the invention predicts the phenotype of a subject.
  • a "set of genes” refers to a group of genes.
  • a “set of genes” according to the invention can be identified by any method now known or later developed to assess gene expression, including but not limited to measurements relating to the biological processes of nucleic acid amplification, transcription, RNA splicing, and translation.
  • a "set of genes” refers to a group of genes that are differentially expressed in a first sample as compared to a second sample.
  • a "set of genes” refers to at least one gene, for example, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or more genes.
  • a "set" refers to at least one.
  • differentially expressed refers to the existence of a difference in the expression level of a nucleic acid or protein as compared between two sample classes, for example a first sample and a second sample as defined herein. Differences in the expression levels of "differentially expressed” genes preferably are statistically significant. Preferably, there is a 2-fold or more (for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 500, 1000-fold or more) increase or decrease in the expression levels of differentially expressed nucleic acid or protein.
  • there is at least a 5% (for example 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99, 100%) increase or decrease in the expression levels of differentially expressed nucleic acid or protein.
  • expression refers to any one of RNA, cDNA, DNA, or protein expression.
  • “Expression values” refer to the amount or level of expression of a nucleic acid or protein according to the invention. Expression values are measured by any method known in the art and described herein. As used herein, “increased” refers to 2-fold or more (for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 500, 1000-fold or more) greater than. “Increased” also refers to at least 5% or more (for example 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99, 100%) greater than. As used herein, “decreased” refers to 2-fold or more (for example, 2, 3, 4, 5, 6, 7, 8, 9,
  • Decreased also refers to at least 5% or more (for example 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99, 100%) less than.
  • a "subset of genes” refers to at least one gene of a "set of genes” as defined herein.
  • a subset of genes is predictive of a particular phenotype, for example, disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non- metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure.
  • predictive means that a set of genes or a subset of genes according to the invention, is indicative of a particular phenotype of interest (for example disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non- metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure).
  • a particular phenotype of interest for example disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic,
  • a subset of genes, according to the invention that is "predictive" of a particular phenotype correlates with a particular phenotype at least 10% or more, for example 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, 50, 51, 52, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99 or 100%.
  • a "phenotype" refers to any detectable characteristic of an organism.
  • a "phenotype" refers to disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non-metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure.
  • diagnosis refers to a process of determining if an individual is afflicted with a disease or ailment.
  • Prognosis refers to a prediction of the probable occurrence and/or progression of a disease or ailment, as well as the likelihood of recovery from a disease or ailment, or the likelihood of ameliorating symptoms of a disease or ailment or the likelihood of reversing the effects of a disease or ailment. "Prognosis” is determined by monitoring the response of a patient to therapy.
  • first sample refers to a sample from a normal subject or individual, or a normal cell line.
  • an “individual” “or “subject” includes a mammal, for example, human, mouse, rat, dog, cow, pig, sheep etc...
  • a “subject” includes both a patient and a normal individual.
  • patient refers to a mammal who is diagnosed with a disease or ailment.
  • normal refers to an individual who has not shown any disease or ailment symptoms or has not been diagnosed by a medical doctor.
  • a “second sample” refers to a sample from a patient or an unclassified individual, or an animal model for a disease of interest.
  • a “second sample” also refers to a sample from a cell line that is a model for a disease of interest, for example a tumor cell line.
  • Tumor is to be construed broadly to refer to any and all types of solid and diffuse malignant neoplasias including but not limited to sarcomas, carcinomas, leukemias, lymphomas, etc., and includes by way of example, but not limitation, tumors found within prostate, breast, colon, lung, and ovarian tissues.
  • a “tumor cell line” refers to a transformed cell line derived from a tumor sample. Usually, a “tumor cell line” is capable of generating a tumor upon explant into an appropriate host.
  • a “tumor cell line” line usually retains, in vitro, properties in common with the tumor from which it is derived, including, e.g., loss of differentiation or loss of contact inhibition, and will undergo essentially unlimited cell divisions in vitro.
  • control cell line refers to a non-transformed, usually primary culture of a normally differentiated cell type.
  • tissue of origin it is preferable to use a "control cell line” and a “tumor cell line” that are related with respect to the tissue of origin, to improve the likelihood that observed gene expression differences or differences in RNA or protein levels, are related to gene expression changes underlying the transformation from control cell to tumor.
  • An “unclassified sample” refers to a sample for which classification is obtained by applying the methods of the present invention.
  • An “unclassified sample” may be one that has been classified previously using the methods of the present invention, or through the use of other molecular biological or pathohistological analyses. Alternatively, an “unclassified sample” may be one on which no classification has been carried out prior to the use of the sample for classification by the methods of the present invention.
  • the fold expression change or differential expression data are logarithmically transformed.
  • logarithmically transformed means, for example, 1Og 10 transformed.
  • multivariate analysis refers to any method of determining the incremental, statistical power of the members of a set of genes to predict a phenotype of interest.
  • Methods of "multivariate analysis” useful according to the invention include but are not limited to multivariate Cox analysis.
  • multivariate Cox analysis refers to Cox proportional hazard survival regression analysis as performed by using the program presented at the world wide web at http://members.aol.com/johnp71/prophaz.html, and as described in Glinsky et al., 2005, J. Clin. Investig. 115:1503.
  • “survival analysis” refers to a method of verifying that a set of genes or a subset of genes according to the invention is “predictive”, as defined herein, of a particular phenotype of interest. “Survival analysis” takes the survival times of a group of subjects (usually with some kind of medical condition) and generates a survival curve, which shows how many of the members remain alive over time. Survival time is usually defined as the length of the interval between diagnosis and death, although other "start” events (such as surgery instead of diagnosis), and other "end” events (such as recurrence instead of death) are sometimes used.
  • “covariates” which may be categorical (such as the kind of treatment a patient received) or continuous (such as the patient's age, weight, or the dosage of a drug). For simple situations involving a single factor with just two values (such as drug vs placebo), there are methods for comparing the survival curves for the two groups of subjects. For more complicated situations, a special kind of regression that allows for assessment of the effect of each predictor on the shape of the survival curve is required.
  • the baseline survival curve is then systematically "flexed” up or down by each of the predictor variables, while still keeping its general shape.
  • the proportional hazards method (for example Cox Multivariate analysis) computes a "coefficient", or "relative weight coefficient" for each predictor variable that indicates the direction and degree of flexing that the predictor has on the survival curve. Zero means that a variable has no effect on the curve - - it is not a predictor at all; a positive variable indicates that larger values of the variable are associated with greater mortality. Knowing these coefficients, a "customized" survival curve for any particular combination of predictor values is constructed. More importantly, the method provides a measure of the sampling error associated with each predictor's coefficient. This allows for assessment of which variables' coefficients are significantly different from zero; that is: which variables are significantly related to survival.
  • Multivariate Cox analysis is used to generate a "relative weight coefficient".
  • a "relative weight coefficient” is a value that reflects the predictive value of each gene comprising a gene set of the invention.
  • Multivariate Cox analysis computes a "relative weight coefficient" for each predictor variable; for example, each gene of a gene set, that indicates the direction and degree of flexing that the predictor has on a survival curve. Zero means that a variable has no effect on the curve and is not a predictor at all. A positive variable indicates that larger values of the variable are associated with greater mortality. Knowing these "relative weight coefficients" a survival curve can be constructed for any combination of predictor values.
  • a “correlation coefficient” means a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, there is a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, there is a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables.
  • correlation coefficients include the correlation coefficient, p X; y ; that ranges between -1 and+1, such as is generated by Microsoft Excel's CORREL function, the Pearson product moment correlation coefficient, r, that also ranges between- 1 and+1, that reflects the extent of a linear relationship between two data sets, such as is generated by Microsoft Excel's PEARSON function, or the square of the Pearson product moment correlation coefficient, r 2 , through data points in known y's and known x's, such as is generated by Microsoft Excel's RSQ function.
  • the r 2 value can be interpreted as the proportion of the variance in y attributable to the variance in x.
  • a correlation coefficient, p x , y is greater than or equal to 0.8, or is greater than or equal to 0.9, or is greater than or equal to 0.95, or is greater than or equal to 0.995.
  • transformations e.g. natural log transformations
  • correlation coefficients either mathematically, or empirically using samples of known classification.
  • the magnitude of the correlation coefficient can be used as a threshold for classification.
  • the appropriate threshold can be determined through the use of test data that seek to classify samples of known classification using the methods of the present invention. The threshold is adjusted so that a desired level of accuracy (e.g., greater than about 70% or greater than about 80%, or greater than about 90% or greater than about 95% or greater than about 99% accuracy is obtained). This accuracy refers to the likelihood that an assigned classification is correct.
  • the tradeoff for the higher confidence is an increase in the fraction of samples that are unable to be classified according to the method. That is, the increase in confidence comes at the cost of a loss in sensitivity.
  • the expression value, or logarithmically transformed expression value for each member of a set of genes is multiplied by a "relative weight coefficient", as defined herein and as determined by multivariate Cox analysis, to provide an "individual survival score" for each member of a set of genes.
  • a "survival score” refers to the sum of the individual survival scores for each member of a set of genes of the invention.
  • Kaplan-Meier survival analysis includes but is not limited to Kaplan-Meier Survival Analysis.
  • Kaplan-Meier survival analysis is carried out using GraphPad
  • a p-value according to the invention is less than or equal to 0.25, preferably less than or equal to 0.1 and more preferably, less than or equal to 0.075, for example, 0.075, 0.070, 0.065, 0.060, 0.055, 0.050 etc... and most preferably less than or equal to 0.05, for example, 0.05, 0.045, 0.040, 0.035, 0.020, 0.010 etc...
  • p-value refers to a p-value generated for a set of genes by multivariate Cox analysis.
  • a "p-value” as used herein also refers to a p-value for each member of a set of genes.
  • a “p-value” also refers to a p-value derived from Kaplan-Meier analysis, as defined herein.
  • a "p-value" of the invention is useful for determining if a set of genes or a subset of genes of the invention is predictive of a phenotype.
  • a “combination of gene sets” refers to at least two gene sets according to the invention.
  • a “combination of gene subsets” refers to at least two gene subsets according to the invention.
  • the term “probe” refers to a labeled oligonucleotide which forms a duplex structure with a gene in a gene set or gene subset of the invention, due to complementarity of at least one sequence in the probe with a sequence in the gene.
  • Probes useful for the formation of a cleavage structure according to the invention are between about 17-40 nucleotides in length, preferably about 17-30 nucleotides in length and more preferably about 17-25 nucleotides in length.
  • a "primer” or an “oligonucleotide primer” refers to a single stranded DNA or RNA molecule that is hybridizable to a gene in a gene set or gene subset of the invention and primes enzymatic synthesis of a second nucleic acid strand.
  • Oligonucleotide primers useful according to the invention are between about 10 to 100 nucleotides in length, preferably about 17-50 nucleotides in length and more preferably about 17-45 nucleotides in length.
  • Figure 1 shows the Kaplan-Meier survival curves for 79 prostate cancer patients stratified into distinct subgroups using a weighted survival predictor score algorithm.
  • Figure 2 Classification of patients diagnosed with four different types of epithelial cancer into sub-groups with distinct therapy outcome based on expression profile of the 11 -gene MTTS/PNS signature.
  • Figures 2A-D show the Kaplan-Meier survival curves for breast cancer patients and ovarian cancer patients stratified into distinct sub-groups using weighted survival predictor score algorithm.
  • Figure 3A-3Q-1 Identification and analysis of cyclin Dl gene signatures.
  • Figure 4A-4C-1 Identification and analysis of Myc gene signatures.
  • Figure 5A-5V-2 Identification and analysis of 100 most variable loci gene signatures.
  • Figure 6A-6K-1 Identification and analysis of 14q32regulon gene signatures.
  • Figure 7A-7R-5 Identification and analysis of Suzl2 gene signatures.
  • Tumors can be extremely heterogeneous due to genomic instability that leads to continuously emerging phenotypic diversity, clonal evolution, and clonal selection that occurs during malignant progression.
  • phenotypic diversity of cancer cells are significant changes in gene expression due to mutations. However, not all mutations and differences in gene expression are crucial or even relevant to the malignant phenotype. It is important to identify expression changes that are highly relevant and characteristic of malignant phenotypes and progression pathways (Hanahan, D., Weinberg, R. A. The hallmarks of cancer. Cell. 2000. 100: 57-70, incorporated herein by reference.).
  • the invention provides methods for identifying expression changes that are highly correlated with, and predictive of certain clinically relevant features of malignant phenotypes and progression pathways.
  • Expression values for any member of a gene set or subset according to the invention can be obtained by any method now known or later developed to assess gene expression, including but not limited to measurements relating to the biological processes of nucleic acid amplification, transcription, RNA splicing, and translation.
  • Direct and indirect measures of gene copy number e.g., as by fluorescence in situ hybridization or other type of quantitative hybridization measurement, or by quantitative PCR
  • transcript concentration e.g., by Northern blotting, expression array measurements or quantitative RT-PCR
  • protein concentration e.g., by quantitative 2-D gel electrophoresis, mass spectrometry, Western blotting, ELISA, or other method for determining protein concentration
  • RNA or mRNA is extracted using the RNeasy (Qiagen, Chatsworth, Calif.) or FastTract kits (Invitrogen, Carlsbad, Calif.). Cell lines are not split more than 5 times prior to RNA extraction, except where noted.
  • Affymetrix http://www.affymetrix.com.
  • approximately one microgram of mRNA is reverse transcribed with an oligo(dT) primer that has a T7 RNA polymerase promoter at the 5' end.
  • Second strand synthesis is followed by cRNA production incorporating a biotinylated base.
  • Hybridization to Affymetrix U95 Av2 arrays representing 12,625 transcripts overnight for 16 h is followed by washing and labeling using a fluorescently labeled antibody.
  • the arrays are read and data processed using Affymetrix equipment and software as reported previously (LaTulippe et al., 2002, Cancer Res. 62:4499; Glinsky et al., 2003 Molecular Carcinogenesis 37 :209).
  • the real-time PCR method measures the accumulation of PCR products with a fluorescence detector system and allows for quantification of the amount of amplified PCR products in the log phase of the reaction.
  • Total RNA is extracted using RNeasy Mini Kit (QIAGEN) according to the manufacturer's instructions. A measure of 1 ⁇ g (tumor samples), or 2 ⁇ g and 4 ⁇ g (independent preparations of reference cDNA samples), of total RNA is then used as a template for cDNA synthesis with Superscript II (Invitrogen Corp.).
  • Q-RT-PCR primer sequences are selected for each cDNA with the aid of Primer Express software (Applied Biosystems). PCR amplification is performed with gene-specific primers.
  • Q-RT-PCR reactions and measurements are performed with SYBR Green and ROX (Applied Biosystems) as a passive reference, using the ABI 7900HT Sequence Detection System (Applied Biosystems).
  • Conditions for the PCR are, for example, as follows: 1 cycle of 10 minutes at 95°C; and 40 cycles of 0.20 minutes at 94 0 C, 0.20 minutes at 6O 0 C, and 0.30 minutes at 72°C.
  • the results are normalized to the relative amount of expression of an endogenous control gene, for example, GAPDH.
  • the methods of the invention use gene expression data from a set of tumor cell lines and compare those data with gene expression data from a set of control cell lines to identify those genes that are differentially expressed in the tumor cell lines as compared to the control cell lines, hi preferred embodiments, each of these sets includes more than a single member, although it is contemplated to be within the scope of the present invention to practice embodiments in which either or both of the set of tumor cell lines and the set of control cell lines includes only one member.
  • the identified genes are referred to as a set of expressed genes.
  • the control cell line and the tumor cell lines are related insofar as the control cell lines represent physiologically normal cells from the tissue or organ from which the tumor represented by the tumor cell lines arose.
  • the control cell lines preferably are primary cultures of normal prostate epithelial cells.
  • more than one tumor cell line and more than one control cell line is used to generate the set of genes so as to reduce the number of genes in the set by eliminating those genes that are not consistently differentially expressed between the tumor and control cell lines.
  • the method may be practiced using only one tumor cell line and one control cell line, and identifying the set of genes differentially expressed between the tumor cell line and the control cell line.
  • the set is more likely to contain only those genes that are consistently differentially expressed between the normal and tumor classes of cell lines (i.e., a gene is included within the set if its expression level is always higher or lower in each of the tumor cell lines examined as compared to each of the control cell lines examined).
  • the methods of the invention are practiced without the use of cell lines, using instead data derived only from clinical samples.
  • the methods of the invention may be practiced using only data derived from cell lines.
  • pairwise comparisons are carried out for each of the 3x6 or 18 pairwise combinations between control cell lines and tumor cell lines.
  • a candidate gene will be included in the set if each of the 18 pairwise comparisons reveals the gene to be consistently differentially expressed (i.e., gene expression always is higher in the control cell line or always higher in the tumor cell line for each of the 18 pairwise comparisons).
  • Such scaling may be routinely implemented in the analysis software provided by commercial suppliers of expression arrays or array readers (such as, e.g., Affymetrix, Santa Clara, Calif.).
  • Affymetrix affymetrix Microarray Suite 4.0 User Guide, Affymetrix, Santa Clara, Calif., incorporated herein by reference.
  • a set of genes according to the invention is therefore a set of genes that have met a screening criterion requiring that the genes be differentially expressed between at least tumor and control cell lines.
  • This criterion reflects the hypothesis that differences in the tumor and control cell phenotypes are driven, at least in part, by differences in gene expression patterns in the tumor and control cells.
  • the methods of the invention may use additional steps to establish a set of expressed genes that are differentially expressed in cells of biological samples that differ with respect to a classification.
  • the classification may be an outcome predictor or cellular phenotype or any type of classification that may be used for classifying biological samples.
  • the classification may be binary (i.e., for two mutually exclusive classes such as, invasive/non-invasive, metastatic/non-metastatic, etc.), or may be continuously or discretely variable (i.e., a classification that can assume more than two values such as, e.g., Gleason scores, survival odds, etc.)
  • the only requirement is that the classified trait must be something that can be observed and characterized by the assignment of a variable or other type of identifier so that samples belonging to the same class may be grouped together during the analysis.
  • a set of expressed genes may also be obtained following essentially the same techniques described above, except sets of samples obtained from in vivo sources are used instead of sets of cell lines.
  • the sample sets preferably consist of tumor samples obtained from patients that are analyzed without any intervening tissue culturing steps so that the gene expression patterns reflect as closely as possible the pattern within cells growing in their undisturbed, in vivo environment.
  • the goal is to obtain a reference set that includes genes differentially expressed between samples belonging to different classifications.
  • the classification of interest is invasiveness (e.g., turning on whether tumor-free surgical margins are observed)
  • the number of pairwise comparisons that can be carried out is of course equal to the product of the numbers of independent samples in each category.
  • each of these pairwise comparisons is carried out and the same criterion for determining differential expression described above is used to select genes for inclusion into a second reference set. It is contemplated that in certain instances, especially, e.g., when the variance within a sample set is low, it will not be necessary to carry out all pairwise comparisons to select genes for inclusion into a set of genes according to the invention.
  • preferred numbers of different cell lines and samples per set used for calculating reference sets be in the range of 2 to 50 per set, or in the range of 2 to 25, or in the range of 2 to 10, or in the range of 3 to 5 per set. While not preferred, it also is contemplated to be within the scope of the present invention to use sets consisting of a single type of cell in one or more of the four sets of input cells used to calculate the first and second reference sets (i.e., tumor cell lines, control cell lines, first sample, and second sample). Direct statistical analysis using T-test and/or Mann- Whitney test for identification of genes differentially expressed in sets of biological samples that differ with respect to a classification is also applicable to the methods of the present invention. The average expression values for genes across the first and second sets of biological samples that differ with respect to a classification are used for calculation of fold expression changes (see below).
  • Gene sets and subsets of the invention are presented in Figures 3-7 and Table 3.
  • the methods of the invention are useful for studying any known cancer including but not limited to adrenal cancer, AIDS-related lymphoma, anal cancer, ataxia-telangiectasia, bladder cancer, brain tumors, breast cancer, cervical cancer, chronic lymphocytic leukemia, chronic myelogenous leukemia, colorectal cancer, craniopharyngioma, cutaneous T-cell lymphoma, endometrial and uterine cancer, esophageal cancer, Ewing's sarcoma, fallopian tube cancer, gallbladder cancer, gastric cancer, gestational trophoblastic disease, choriocarcinoma, Hairy cell leukemia, Hodgkin's disease, Kaposi's sarcoma, kidney cancer, laryngeal cancer, leukemia including acute lymphocytic leukemia and acute myelogenous leukemia, Li-Fraumeni syndrome, liver cancer, lung cancer, Hodgkin's
  • the methods of the invention are useful for studying the gene expression profiles that are predictive of prostate cancer, breast cancer and cancer metastasis.
  • Breast cancer is the most common cancer among women in North America and Western Europe and is the second leading cause of female cancer death in the United States. In the United States, age-adjusted breast cancer incidence rates have considerably increased during last century. Approximately 40% of patients diagnosed with breast cancer have disease that has regional or distant metastases and, at present, there is no efficient curative therapy for breast cancer patients with advanced metastatic disease. Developing a treatment strategy appropriate for any individual with early stage disease is difficult and insufficient treatment leads to local disease extension and metastasis. Therefore, there is an urgent clinical need for novel diagnostic methods that would allow early identification of those breast cancer patients who are likely to develop metastatic disease and would require the most aggressive and advanced forms of therapy for increased chance of survival. The identification of those genetic changes that distinguish aggressive metastatic disease and predict metastatic behavior would, therefore, be a breakthrough. The methods of the present invention provide information that allows prognostication of aggressive metastatic disease.
  • Cancer cells have exceedingly low survival rates in the circulation (reviewed in Glinsky, G. V. 1993. Cell adhesion and metastasis: is the site specificity of cancer metastasis determined by leukocyte-endothelial cell recognition and adhesion? Crit. Rev. Onc./Hemat., 14: 229-278, incorporated herein by reference). Even if the bloodstream contains many cancer cells, there may be no clinical or pathohistological evidence of metastatic dissemination into the target organs (Williams, W. R. The theory of Metastasis. In The Natural History of Cancer. 1908; 442-448; Goldmami, E. 1907. The growth of malignant disease in man and the lower animals, with special reference to the vascular system. Proc. R.
  • the individual "average" cancer cell survives only a short time in the circulation.
  • the successful metastatic cancer cells are able to find a largely unknown survival and escape route.
  • Patients at high risk for metastatic disease could be better managed if gene expression patterns correlated with a clinical metastatic phenotype are identified.
  • the methods of the present invention identify such gene expression patterns. Patients' tumor samples can be tested to see whether the gene expression pattern is associated with an increased risk of metastasis, and if so, the patients can be treated with more aggressive therapies to lower the risk of metastasis.
  • the present invention provides for methods that allow identification of such gene expression patterns, and sample classification based on those patterns.
  • multivariate analysis is multivariate Cox analysis as described in Glinsky et al., 2005 J. Clin. Invest. 115 : 1503.
  • multivariate Cox analysis refers to Cox proportional hazard survival regression analysis as performed by using the program presented at the world wide web at http://members.aol.com/johnp71/prophaz.html, and as described in Glinsky et al., 2005, J. Clin, hivestig. 115:1503.
  • the invention also provides for implementation of a weighted survival score analysis.
  • Weighted survival score analysis reflects the incremental statistical power of individual covariates as predictors of therapy outcome based on a multicomponent prognostic model. For example, microarray-based or Q-RT-PCR-derived gene expression values are normalized and log-transformed on a base 10 scale. The log-transformed normalized expression values for each data set are analyzed in a multivariate Cox proportional hazard regression model, with overall survival or event-free survival as the dependent variable. To calculate the survival/prognosis predictor score for each patient, the log-transformed normalized gene expression value measured for each gene are multiplied by a coefficient derived from the multivariate Cox proportional hazard regression analysis, for example a relative weight coefficient, as defined herein.
  • Final survival predictor score comprises a sum of scores for individual genes and reflects the relative contribution of each of the genes in the multivariate analysis.
  • the negative weighting values indicate that higher expression correlates with longer survival and favorable prognosis, whereas the positive score values indicate that higher expression correlates with poor outcome and shorter survival.
  • the weighted survival predictor model is based on a cumulative score of the weighted expression values of all of the genes of a set of genes.
  • the invention provides for an individual survival score for each member of a set of genes, calculated by multiplying the expression value or the logarithmically transformed expression value for each member of a set of genes by a relative weight coefficient or a correlation coefficient, as determined by multivariate Cox analysis.
  • the invention also provides for a survival score, wherein a survival score is the sum of the individual survival scores for each member of a set of genes.
  • Survival analysis refers to a method of verifying that a set of genes or a subset of genes according to the invention is "predictive", as defined herein, of a particular phenotype of interest.
  • Survival analysis includes but is not limited to Kaplan-Meier survival analysis.
  • the Kaplan-Meier survival analysis is carried out using the Prism 4.0 software. Statistical significance of the difference between the survival curves for different groups of patients was assessed using Chi square and Logrank tests.
  • the Kaplan-Meier survival analysis is carried out using GraphPad Prism version 4.00 software (GraphPad Software).
  • the endpoint for survival analysis in prostate cancer is the biochemical recurrence defined by the serum prostate-specific antigen (PSA) increase after therapy.
  • Disease-free interval is defined as the time period between the date of radical prostatectomy (RP) and the date of PSA relapse (for the recurrence group) or the date of last follow-up (for the non-recurrence group).
  • RP radical prostatectomy
  • RP radical prostatectomy
  • RP date of PSA relapse
  • last follow-up for the non-recurrence group
  • Statistical significance of the difference between the survival curves for different groups of patients is assessed using X 2 and log-rank tests.
  • the major mathematical complication with survival analysis is that you usually do not have the luxury of waiting until the very last subject has died of old age; you normally have to analyze the data while some subjects are still alive. Also, some subjects may have moved away, and may be lost to follow-up, hi both cases, the subjects were known to have survived for some amount of time (up until the time the one performing the analysis last saw them). However, the one performing the analysis may not know how much longer a subject might ultimately have survived.
  • Several methods have been developed for using this "at least this long" information to preparing unbiased survival curve estimates, the most common being the Life Table method and the method of Kaplan and Meier Analysis, as defined herein.
  • the methods of the invention and the subset of genes of the invention are useful for identifying a subset of genes for use in predicting a phenotype of the invention, for example disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non- metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure.
  • the methods of the invention are also useful for predicting a phenotype in a subject.
  • the gene subsets of the invention are useful for predicting the interval to disease recurrence, distant metastasis, and death after therapy.
  • the methods of the invention are useful for identifying markers of malignant phenotypes for diagnostic and prognostic purposes, as well as for drug discovery purposes.
  • a subset of genes for use in predicting a phenotype in a subject can be generated as follows.
  • a set of expression values for a set of genes are obtained for a first and a second sample by measuring the level of expression in the first and second samples according to any method known in the art and described herein.
  • Genes that are differentially expressed are identified by comparing the level of expression in the first sample with the level of expression in the second sample. An expression value that is increased or decreased in the first sample as compared to the second sample is differentially expressed.
  • the significant prognosis predictors in univariate analysis were Cy din Bl, BUBl, HEC, and the 11-gene signature.
  • the analysis seems to indicate that individual genes demonstrate a variable performance across multiple outcome data sets and no single gene was identified that was uniformly predictive of the poor therapy outcome.
  • the most significant prostate cancer recurrence predictor was the model that included 11 covariates (11 -gene signature, four individual genes (KI67; ANK3; FGFR2; CESl); and six clinico-pathological features (pre RP Gleason sum; surgical margins; seminal vesicle invasion; age; and extra-capsular extension)).
  • Final survival predictor score comprises a sum of scores for individual genes and reflects the relative contribution of each of the eleven genes in the multivariate analysis.
  • the negative weighting values imply that higher expression correlates with longer survival and favorable prognosis, whereas the positive score values indicate that higher expression correlates with poor outcome and shorter survival.
  • Application of the weighted survival predictor model based on a cumulative score of the weighted expression values of eleven genes confirmed the prognostic power of the identified 11-gene signature in stratification of prostate cancer patients into sub-groups with statistically distinct probability of relapse-free survival after radical prostatectomy (Figure 1).
  • Kaplan-Meier analysis indicates that breast cancer patients with tumors displaying a stem cell-like expression profile of the 11 -gene signature have a significantly higher probability of developing distant metastases within 5 years after therapy and therefore can be identified as a poor prognosis sub-group (data not shown).
  • Median metastasis-free survival after therapy in the poor prognosis sub-group of breast cancer patients defined by the 11-gene signature was 26 months.
  • 84 % of patients in the poor prognosis sub-group were diagnosed with distant metastasis within 5 years after therapy (data not shown). In contrast, 62 % of patients in the good prognosis sub-group remained metastasis-free (data not shown).
  • the estimated hazard ratio for metastasis-free survival after therapy in the poor prognosis sub-group as compared with the good prognosis sub-group of patients defined by the 11-gene signature was 3.762 (95% confidence interval of ratio, 3.421 to 20.27; P ⁇ 0.0001).
  • expression pattern of the 11-gene MTTS/PNS signature is strongly predictive of a short post-diagnosis and post- treatment interval to distant metastases in early stage breast cancer patients. It was determined if expression analysis of the 11-gene signature would be informative in patient's stratification into sub-groups with distinct survival probability after therapy in the group of 125 patients diagnosed with lung adenocarcinoma (Bhattacharjee et al., 2001 Proc. Natl. Acad. Sci. USA 98:13790).
  • Figures 2A and 2B - 2D show the Kaplan-Meier survival curves for 97 breast cancer patients and ovarian cancer patients stratified into distinct sub-groups using weighted survival predictor score algorithm.
  • the Kaplan-Meier analysis shows that patients with tumors displaying a stem cell-like expression profile of the 11-gene signature have significantly higher risk of death after therapy and therefore can be defined as a poor prognosis sub-group (data not shown).
  • Median survival after therapy in the poor prognosis sub-group of lung adenocarcinoma patients defined by the 11 -gene BMI-I -pathway signature was 15.2 months (data not shown).
  • the median survival after therapy in the good prognosis sub-group was 48.8 months. 100 % of patients in the poor prognosis subgroup died within 3 years after therapy.
  • Clinical Samples Expression profiling data of primary tumor samples obtained from 1122 cancer patients representing therapy outcome cohorts for 10 types of human cancer (Table 2) were analyzed in this study. Microarray analysis and associated clinical information for 32 clinical samples (23 primary prostate tumors and 9 distant metastatic lesions) utilized to delineate the expression profiles of human prostate cancer metastases were reported previously (LaTulippe et al., 2002 Cancer Res. 62:4499). Two clinical outcome sets comprising 21 (outcome set 1) and 79 (outcome set 2) samples were utilized for analysis of the association of the therapy outcome with distinct expression profiles of the 11 -gene signature. Original gene expression profiles of the 21 clinical samples analyzed in this study were reported elsewhere (Singh et al., 2002 Cancer Cell 1 :203). Primary gene expression data files of clinical samples as well as associated clinical information can be found on the world wide web at genome.wi.mit.edu/cancer/.
  • Prostate tumor tissues comprising second clinical outcome set were obtained from 79 prostate cancer patients undergoing therapeutic or diagnostic procedures performed as part of routine clinical management at the Memorial Sloan-Kettering Cancer Center (MSKCC). Clinical and pathological features of 79 prostate cancer cases comprising validation outcome set are presented elsewhere (Glinsky et al., 2004 J. Clin. Invest. 113:913). Median follow-up after therapy in this cohort of patients was 70 months. Samples were snap-frozen in liquid nitrogen and stored at - 8O 0 C. Each sample was examined histologically using H&E-stained cryostat sections. Care was taken to remove nonneoplastic tissues from tumor samples. Cells of interest were manually dissected from the frozen block, trimming away other tissues. AU of the studies were conducted under MSKCC Institutional Review Board-approved protocols.
  • LNCap- and PC-3-derived cell lines were developed by consecutive serial orthotopic implantation, either from metastases to the lymph node (for the LN series), or reimplanted from the prostate (Pro series). This procedure generated cell variants with differing tumorigenicity, frequency and latency of regional lymph node metastasis (Glinsky et al., 2003 MoI. Carcinog. 37:209).
  • cell lines were grown in RPMIl 640 supplemented with 10% FBS and gentamycin (Gibco BRL) to 70-80% confluence and subjected to serum starvation as described (Glinsky et al., 2003 MoI. Carcinog. 37:209), or maintained in fresh complete media, supplemented with 10% FBS.
  • Anoikis assay Cells were harvested by 5-min digestion with 0.25% trypsin/0.02% EDTA (Irvine Scientific, Santa Ana, CA, USA), washed and resuspended in serum free medium. Cells at a concentration 1.7 x 10 5 cells/well in 1 ml of serum free medium were plated in 24- well ultra low attachment polystyrene plates (Corning Inc., Corning, NY, USA) and incubated at 37 0 C and 5% CO 2 overnight. Viability of cell cultures subjected to anoikis assays were > 95% in Trypan blue dye exclusion test.
  • Apoptosis assay Apoptotic cells were identified and quantified using the Annexin V-FITC kit (BD Biosciences Pharmingen, world wide web at bdbisciences.com) per manufacturer instructions. The following controls were used to set up compensation and quadrants: 1) Unstained cells; 2) Cells stained with Annexin V-FITC (no PI); 3) Cells stained with PI (no Annexin V-FITC). Each measurement was carried out in quadruplicate and each experiment was repeated at least twice.
  • Annexin V-FITC positive cells were scored as early apoptotic cells; both Annexin V-FITC and PI positive cells were scored as late apoptotic cells; unstained Annexin V-FITC and PI negative cells were scored as viable or surviving cells. In selected experiments apoptotic cell death was documented using the TUNEL assay.
  • Orthotopic xenografts Orthotopic xenografts of human prostate PC-3 cells and sublines used in this study were developed by surgical orthotopic implantation as previously described (Glinsky et al., 2004 J. Clin. Invest. 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209). Briefly, 2 x 10 6 cultured PC-3 cells, PC-3M or PC-3MLN4 sublines were injected subcutaneously into male athymic mice, and allowed to develop into firm palpable and visible tumors over the course of 2 - 4 weeks.
  • Intact tissue was harvested from a single subcutaneous tumor and surgically implanted in the ventral lateral lobes of the prostate gland in a series of six athymic mice per cell line subtype as described earlier (Glinsky et al., 2003 MoI. Carcinog. 37:209).
  • TRAMP transgenic adenocarcinoma of the mouse prostate
  • the TRAMP mice colony is based on a breeding pair of TRAMP mice kindly provided by Norman Greenberg (Baylor College of Medicine, Houston, TX). Standard PCR assay was carried out for monitoring the presence of the SV40 large T-antigen in new litters. Twenty-one PCR-conf ⁇ rmed male TRAMP mice were defined for microarray analysis carried out in this study.
  • RNA and mRNA extraction were harvested in lysis buffer 2 hrs after the last media change at 70-80% confluence and total RNA or mRNA was extracted using the RNeasy (Qiagen, Chatsworth, CA) or FastTract kits (Invitrogen, Carlsbad, CA). Cell lines were not split more than 5 times prior to RNA extraction, except where noted.
  • Affymetrix arrays The protocol for mRNA quality control and gene expression analysis was that recommended by Affymetrix (on the world wide web at affymetrix.com). In brief, approximately one microgram of mRNA was reverse transcribed with an oligo(dT) primer that has a T7 RNA polymerase promoter at the 5' end. Second strand synthesis was followed by cRNA production incorporating a biotinylated base. Hybridization to Affymetrix U95 Av2 arrays representing 12,625 transcripts overnight for 16 h was followed by washing and labeling using a fluorescently labeled antibody. The arrays were read and data processed using Affymetrix equipment and software as reported previously (LaTulippe et al, 2002, Cancer Res. 62:4736; Glinsky et al., J. Clin. Invest. 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209).
  • Affymetrix technology have been reported (Glinsky et al., J. Clin. Invest. 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209). 40-50% of the surveyed genes were called present by the Affymetrix Microarray Suite 5.0 software in these experiments. The concordance analysis of differential gene expression across the data sets was performed using Affymetrix MicroDB v. 3.0 and DMT v.3.0 software as described earlier (LaTulippe et al., 2002, Cancer Res.
  • microarray data was processed using the Affymetrix Microarray Suite v.5.0 software and statistical analysis of expression data sets were processed using the Affymetrix MicroDB and Affymetrix DMT software.
  • the Pearson correlation coefficient for individual test samples and appropriate reference standard was determined using the Microsoft Excel and the
  • GraphPad Prism version 4.00 software The significance of the overlap between the lists of stem cell-associated and prostate cancer-associated genes was calculated by using the hypergeometric distribution test (Tavazoie et al., 1999, Nat. Genet. 22:281). Analytical protocol of identification and validation of the 11 -gene BMI-I -pathway signature is described below.
  • the Multiple Experiments Viewer (MEV) software version 3.0.3 of the Institute for Genomic Research (TIGR) for Support Vector Machine (SVM) was used for classification and terrain (TRN) clustering algorithm data analysis and visualization. Protocol of discovery and validation of the 11-gene BMI-I -pathway signature.
  • transcripts activated and suppressed in prostate cancer metastases would recapitulate the expression profile of the BMI-I -regulated genes in neural stem cells by comparing the sets of differentially regulated genes in search for union/intersections of lists for both up- and down-regulated transcripts.
  • the primary criterion in transcript selection process should be the concordance of changes in expression rather than a magnitude of changes (e.g., fold change).
  • transcripts of interest would be expected to have a tightly controlled "rank order" of expression within a cluster of co-regulated genes reflecting a balance of up- and down-regulated mRNAs as a desired regulatory end-point in a cell.
  • this analytical step defined three large parent signatures (Data not shown): MTTS signature comprising 868 up-regulated and 477 down- regulated transcripts; PNS signature comprising 885 up-regulated and 1088 down-regulated transcripts; and CNS signature comprising 769 up-regulated and 778 down-regulated transcripts.
  • Step 2 Sub-sets of transcripts exhibiting concordant expression changes in metastatic TRAMP tumor samples (MTTS signature) as well as PNS (PNS signature) and CNS (CNS signature) neurospheres in BMT-1 +/+ versus BMI-V 1' backgrounds were identified.
  • MTTS signature metastatic TRAMP tumor samples
  • PNS signature PNS signature
  • CNS signature CNS neurospheres in BMT-1 +/+ versus BMI-V 1' backgrounds were identified.
  • transcripts were obtained by intersections of the two lists of up-regulated and the two lists of down-regulated genes.
  • Step 3 Selection of small gene clusters was performed from sub-sets of genes exhibiting concordant changes of transcript abundance behavior in metastatic TRAMP tumor samples and PNS and CNS neurospheres in BMI-U 1+ versus BMI-T 1" backgrounds. Expression profiles were presented as LoglO average fold changes for each transcript and processed for visualization and Pearson correlation analysis using Microsoft Excel software. For the concordant differentially expressed genes vectors of loglO average fold change were determined for both experimental settings and the correlation between two vectors was computed. Practical considerations essential for future development of genetic diagnostic tests prompted selection from concordant gene sets small gene expression signatures comprising transcripts with high level of expression correlation in metastatic cancer cells and stem cells.
  • the concordant list of differentially expressed genes was reduced by removing from the list genes whose removal lead to the largest increase in the correlation coefficient.
  • the reduction in the signature transcript number was terminated when further elimination of a transcript did not increase the value of the Pearson correlation coefficient. Cut-off criterion for signature reduction was arbitrarily set to exceed a Pearson correlation coefficient 0.95 (P ⁇ 0.0001). Using this approach a single candidate prognostic gene expression signature was selected for each intersection of the MTTS signature and parent stem cell signatures (data not shown).
  • three highly concordant small signatures were identified corresponding to three concordant sub-sets of genes defined in the Step 2 (a set of 11 genes comprising 8 up- regulated and 3 down-regulated transcripts for PNS neurospheres, 11-gene MTTS/PNS signature; a set of 11 genes comprising 7 up-regulated and 4 down-regulated transcripts for CNS neurospheres, 11 -gene MTTS/CNS signature; and a set of 14 genes comprising 8 up- regulated and 6 down-regulated transcripts, MTTS/PNS/CNS signature).
  • Step 4 The small signatures (one 11-gene signature for the PNS set, one 11 -gene signature for the CNS set, and one 14-gene signature for common PNS/CNS set) identified in Step 3 were tested for metastatic phenotype discriminative power (using one mouse prostate cancer data set and one human prostate cancer data set comprising primary and metastatic tumors) and therapy outcome classification performance (using human prostate cancer therapy outcome set 1). Three identified small signatures were evaluated for their ability to discriminate metastatic and primary prostate tumors in a TRAMP mouse model of prostate cancer, clinical samples of 9 metastatic versus 23 primary prostate tumors as well as primary prostate tumors from 21 patients with distinct outcome after the therapy (8 recurrent and 13 non-recurrent samples).
  • Negative expression values were treated as missing data. Based on expected correlation of expression profiles of identified gene clusters with stem cell-like expression profiles, the corresponding correlation coefficients calculated for individual samples were given the identifier of the stem cell-resembling phenotype association indices (SPAIs). The prognostic power of identified small signatures were evaluated based on their ability to discriminate the metastatic versus primary tumors (criterion 1) and to segregate the patients with recurrent and non-recurrent prostate tumors into distinct sub-groups (criterion 2). A single best performing small signature was selected for subsequent validation analysis (data not shown). Based on diagnostic and prognostic classification performance, a single best performing 11-gene MTTS/PNS signature was selected for further validation analysis (Data not shown). Step 5.
  • the training set was used to select the prognosis discrimination cut-off value for a signature based on highest level of statistical significance in patient's stratification into poor and good prognosis groups as determined by the log-rank test (lowest P value and highest hazard ratio in the training set). Clinical samples having the Pearson correlation coefficient at or higher than the cut-off value were identified as having the poor prognosis signature. Clinical samples with the Pearson correlation coefficient below the cut-off value were identified as having the good prognosis signature. Each training set was used to estimate a threshold of the correlation coefficients before performing a survival analysis. The same discrimination cut off value was then applied to evaluate the reproducibility of the prognostic performance in the test set of patients.
  • the model was applied to the entire outcome set using the same cut off threshold to confirm the classification performance.
  • the average gene expression vectors were computed for each gene and applied separately on the training, test, and the combined data sets.
  • the training and test sets were balanced with respect to the total number of patients, negative and positive therapy outcomes, and the length of survival.
  • For breast cancer data set the patients' distribution among training and test data sets described in the original publication (van 't Veer, LJ. et al., 2002, Nature 415:530) were maintained.
  • additional model training, development or optimization steps, with the exception of a prognostic cut off threshold selection in a training set were not carried out.
  • the same MTTS/PNS expression profile was consistently used throughout the study as a reference standard to quantify the Pearson correlation coefficients of the individual samples.
  • Step 7 The model performance was tested using various sample stratification approaches such as terrain (TRN) clustering (data not shown), support vector machine (SVM) classification (data not shown), and weighted survival score algorithm ( Figures 1 and 2A).
  • TRN terrain
  • SVM support vector machine
  • Figures 1 and 2A weighted survival score algorithm
  • the therapy outcome predictive power of the 11 -gene model in prostate cancer setting was evaluated using prognostic test based on independent method of gene expression analysis, namely quantitative reverse-transcription polymerase chain reaction (Q-RT-PCR) method (data not shown).
  • Q-RT-PCR quantitative reverse-transcription polymerase chain reaction
  • SPAIs phenotype association index
  • a standard PNS neurosphere and TRAMP metastasis values were established (data not shown). They were used as uniform reference standards for measurements of Pearson correlation coefficients for clinical samples consistently throughout the study.
  • a degree of resemblance of the transcript abundance rank order within a gene cluster between a test sample and reference standard is measured by a Pearson correlation coefficient and designated as a phenotype association index (PAI).
  • Samples with stem cell-resembling expression profiles are expected to have positive values of Pearson correlation coefficients.
  • Random co-occurrence test A 10,000 permutations test was performed to check the likelihood that small 11-gene signatures derived from the large MTTS signature would display high discrimination power to assess the significance at the 0.1% level.
  • the sample stratification power of 10,000 permutations of small 11 -gene signatures derived from the large 1345-gene MTTS signature was compared to the 11-gene MTTS/PNS signature.
  • Random concordant gene sets comprising -200 transcripts were generated using a mouse transcriptome data set representing expression profiling data of -12,000 transcripts across 45 normal tissues (Su et al., 2002 Proc. Natl. Acad. Sci. USA 99:4465). Inter- and intra-species array to array probe set match was performed at 95% or greater identity level using the Affymetrix data base (available on the world wide web at affymetrix.com). To assess discrimination of random 11-gene signatures derived from the 1345-gene MTTS signature two-tailed T-test was carried out for metastatic versus primary prostate cancer data set (32 samples) and recurrent versus non-recurrent prostate cancer data set (21 samples).
  • the signatures were ranked based on p-values and ranking metrics of each random 11-gehe signature were compared to the 11 -gene MTTS/PNS signature p-values. 10,000 permutations were found to generate 7 random 11-gene signatures performing at sample classification level of the 11 -gene MTTS/PNS signature.
  • Weighted survival predictor score algorithm The weighted survival score analysis was implemented to reflect the incremental statistical power of the individual covariates as predictors of therapy outcome based on a multi-component prognostic model.
  • the microarray-based or Q-RT-PCR-derived gene expression values were normalized and log- transformed on a base 10 scale.
  • the log-transformed normalized expression values for each data set were analyzed in a multivariate Cox proportional hazards regression model, with overall survival or event-free survival as the dependent variable.
  • Final survival predictor score comprises a sum of scores for individual genes and reflects the relative contribution of each of the eleven genes in the multivariate analysis.
  • the negative weighting values indicate that higher expression correlates with longer survival and favorable prognosis, whereas the positive score values indicate that higher expression correlates with poor outcome and shorter survival.
  • the weighted survival predictor model is based on a cumulative score of the weighted expression values of eleven genes.
  • relapse-free survival score (-0.403xGbx2) + (1.2494xKI67) + (-0.3105xCyclin Bl) + (- 0.1226xBUBl) + (0.0077xHEC) + (0.0369xKIAA1063) + (-1.7493xHCFCl) + (- 1.1853xRNF2) + (1.5242xANK3) + (-0.5628xFGFR2) + (-0.4333xCESl).
  • BMI-I siRNA experiments The target siRNA SMART pools for BMI-I and control lueiferase siRNAs were purchased from Dharmacon Research, Inc. They were transfected into PC-3-32 human prostate carcinoma cells according to the manufacturer's protocols. Cell cultures were continuously monitored for growth and viability and assayed for niRNA expression levels of BMI-I and selected set of genes (Table 2 and Figure 2) using RT-PCR and Q-RT-PCR methods. Quantitative RT-PCR analysis. The real time PCR methods measures the accumulation of PCR products by a fluorescence detector system and allows for quantification of the amount of amplified PCR products in the log phase of the reaction.
  • mRNA messenger RNA
  • GPDH endogenous control gene
  • TTCCTCTTGTGCTCTTGCTGG- 3' was used as the endogenous RNA and cDNA quantity normalization control.
  • cDNA prepared from primary in vitro cultures of normal human prostate epithelial cells (Glinsky et al., 2004 J. Clin. Invest 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209)
  • cDNA derived from the PC-3M human prostate carcinoma cell line (Glinsky et al., 2004 J. Clin. Invest 113:913; Glinsky et al., 2003 MoI. Carcinog.
  • MCL mantle cell lymphoma
  • AML acute myeloid leukemia
  • RP radical prostatectomy
  • PSA prostate specific antigen
  • SM surgical margins
  • GLSN SUM Gleason sum
  • Sem Ves Inv seminal vesicle invasion
  • ECE extracapsular extension.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Zoology (AREA)
  • Immunology (AREA)
  • Wood Science & Technology (AREA)
  • Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Microbiology (AREA)
  • Oncology (AREA)
  • Hospice & Palliative Care (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to methods of identifying gene subsets for use in predicting a phenotype in a subject. The invention also relates to methods of predicting a phenotype in a subject. The invention additionally relates to methods of validating a gene set. The invention further relates to subsets of genes and kits comprising subsets of genes for use in predicting a phenotype in a subject.

Description

METHODS OF IDENTIFICATION AND USE OF GENE SIGNATURES
This application claims the benefit of U.S. Provisional Application No.: 60/721,875, filed on September 29, 2005, which is incorporated herein by reference in its entirety.
FIELD OF THE INVENTION The invention relates to methods of identifying a set of genes that can predict a phenotype, and the use of these gene sets for predicting a phenotype of interest.
BACKGROUND
A growing number of genome-wide expression profiling studies provide experimental evidence indicating the presence of a transcriptionally distinct sub-type of human solid tumors manifesting a marked propensity toward metastatic dissemination, highly malignant clinical behavior, and a high probability of poor therapy outcome in cancer patients diagnosed with early stage carcinomas of various origins. These results are consistent with the idea that in a sub-set of human solid tumors, the acquisition of full metastatic potential, including an emergence and seeding of potent metastasis precursor cells, is a relatively early event in tumor progression. Collectively, these data suggest an early involvement in development of this transcriptionally defined sub-type of human carcinomas of a highly malignant combination of mutant alleles conferring the proclivity to metastasize and/or an engage in transformation and tumor progression of unique non-conventional cellular targets such as stem cells and/or early progenitor cells. Microarray analysis of genome- wide gene expression patterns of tumors holds significant promise to improve the diagnosis, risk stratification, and therapy outcome prediction in cancer patients. In addition to highly anticipated translational implications, microarray-based analysis of global gene expression profiles has revealed novel insights into molecular taxonomy and pathogenesis of many cancers by identifying molecularly distinct subtypes of cancer in disease groups that were viewed previously as homogenous diagnostic categories based on existing classical clinico-pathological classification models. Hypothesis- driven global gene expression profiling approach was successfully utilized to identify molecular signatures associated with activation of oncogenic pathways (Huang et al. 2003 '"'Nature Genetics 34:226; Ellwood-Yen et al., 2003 Cancer Cell 4:223; Lee et al., 2004 Nat Genet. 36:1306; Sweet-Cordero et al., 2005 Nat. Genet. 37:48), targeted genetic manipulations (Lamb et al., 2003 Cell 114:323), or cellular responses to physiological stimuli (Chang et al., 2004 PLoS Biol 2:E7; Chang et al., 2005 Proc. Natl. Acad. Sci. USA 102:3738) and build robust transcriptional identifiers reliably recognizing the engagement of corresponding pathways within the high complexity patterns of gene expression in experimental and clinical tumor samples. Most recent hypothesis-driven gene expression profiling studies suggest the presence of a highly malignant subset of human cancers diagnosed in a wide range of organs and uniformly exhibiting a marked propensity toward metastatic dissemination as well as a high probability of unfavorable therapy outcome. Unsupervised hierarchical clustering analysis of human breast tumors revealed remarkable stability of unique distinct gene expression patterns characteristic of tumors repeatedly sampled from the same individuals (Perou et al., 2000 Nature 406:747). One of the striking conclusions of these early observations was that gene expression patterns of tumor samples from the same individual were more similar to each other than either sample was to any other tumor sample within a data set. The epigenetic stability of individual molecular portraits of human breast tumors was confirmed when tumors were sampled before and after a 16-week course of doxorubicin chemotherapy as well as when primary tumors and lymph node metastases were compared (Perou et al., supra), suggesting that the unique make- up of genetic traits governing mRNA abundance levels of thousands of genes in tumors was maintained during tumor progression and remained stable during chemotherapy.
The next important insight came from the expression profiling study delineating a genetic signature of metastasis associated with gene expression patterns of distant metastatic lesions recovered from multiple organs of patients diagnosed with several types of common epithelial cancers (Ramaswamy et al., 2003 Nature Genetics 33:49). This study demonstrated that expression of a molecular signature of metastasis in primary solid tumors of seemingly various origins was almost uniformly associated with poor prognosis and short survival after therapy in patients with multiple types of cancer.
Several expression profiling studies identified transcriptionally distinct sub-sets of early-stage human carcinomas manifesting a marked propensity toward metastatic dissemination, highly malignant clinical behavior, and a poor survival after therapy (van't Veer et al., 2002 Nature 415:530; van de Vijver et al., 2002 N. Engl. J. Med. 347:1999; Glinsky et al., 2004 J. Clin Invest. 113:913; Glinsky et al., 2004 Clin. Cancer Res. 10:2272; Wang et al., 2005 Lancet 365:671). These expression profiling experiments identified genetic signatures that may be utilized as highly promising diagnostic and/or prognostic markers; however, they did not reveal any underlying genetic, molecular, or biological mechanistic insights.
Two new lines of evidence supporting this emerging genomic view of metastatic cancer were presented recently in hypothesis-driven global gene expression profiling studies (Chang et al., 2004, supra; Chang et al., 2005, supra; Glinsky et al., 2005 J. Clin Invest. 114: 1503). The starting point in these experiments was identification of gene expression signatures associated with experimentally-defined biologically important contexts such as wound healing (Chang et al., 2004, supra; Chang et al., 2005, supra) or self-renewal function of stem cells (Glinsky et al., 2005, supra). The gene expression signatures were then used to interrogate and interpret the gene expression profiles of human cancers.
In several types of common epithelial tumors such as breast, lung, and gastric cancers, expression of the wound-response signature in primary tumors predicted increased risk of metastasis and poor survival after therapy (Chang et al., 2004, supra; Chang et al., 2005, supra). These results strongly imply that primary tumors that more faithfully recapitulate gene expression pattern of the wound-response signature and, perhaps, transmit a "presence of wound signal", would also exhibit increased metastatic potential and are more likely to fail therapy. One of the possible consequences of a tumor's ability to transmit a "presence of wound" signal might be propensity to attract and accumulate stem cells as a part of physiological response to injury. Accumulation of normal stem cells in experimental tumors in vivo has been demonstrated in several studies (Aboody et al., 2000 Proc. Natl. Acad. Sci. USA 97:12846; Brown et al., Hum Gene Ther. 14:1777). One of the hallmark biological features of normal stem cells is the ability to fuse spontaneously in vitro and in vivo with other cell types leading to formation of reprogrammed viable somatic cell hybrids (Wang et al., 2003 Nature 422:897;
Vassilopoulos et al., 2003 Nature 422:901; Alvarez-Dolado et al. 2003 Nature 425:968-973. Weimarm, et al., 2003 Nature Cell Biology 5:959). Furthermore, most recent studies demonstrated that committed myelomonocytic cells such as macrophages can produce functional epithelial cells by in vivo fusion (Willenbring et al., 2004 Nature Medicine 10:744), thus extending the number of cell types that might serve as hypothetically "eligible" fusion partners for tumor cells.
A mouse/human comparative translational genomics approach was utilized to identify an 11-gene signature distinguishing stem cells with normal self-renewal function versus stem cells with drastically diminished self-renewal ability due to the loss of the BMI-I gene; this signature was then used to interrogate and interpret expression patterns of human cancers (Glinsky et al., 2005, supra). The 11-gene signature consistently displays a normal stem cell- like expression profile in distant metastatic lesions as revealed by the analysis of metastases and primary tumors from a transgenic mouse model of prostate cancer and cancer patients. To further validate these results, the prognostic power of the 11-gene signature was examined in several independent therapy outcome sets of clinical samples obtained from 1153 cancer patients diagnosed with multiple types of cancer, including five epithelial (prostate; breast; lung; ovarian; and bladder cancers) and five non-epithelial (lymphoma; mesothelioma; medulloblastoma; glioma; and acute myeloid leukemia, AML) malignancies. Kaplan-Meier analysis demonstrated that a normal stem cell-like expression profile of the 11-gene signature in primary tumors is a consistent powerful predictor of a short interval to disease recurrence, distant metastasis, and death after therapy in cancer patients diagnosed with eleven distinct types of cancer (Glinsky et al., 2005, supra). These data suggest the presence of a conserved BMI-I oncogene-driven pathway, which is similarly engaged in both normal stem cells and a highly malignant subset of human cancers diagnosed in a wide range of organs and uniformly exhibiting a marked propensity toward metastatic dissemination as well as a therapy resistance phenotype.
There is a need in the art for a reliable method for assessing the clinical relevance of gene expression signatures derived by various approaches and based on different conceptual justifications. There is also a need in the art for identifying small informative gene sets derived from large gene expression signatures that retain their prognostic and/or diagnostic power. Such small gene sets would facilitate the development and clinical implementation of simple, inexpensive, and reliable genetic diagnostic and/or prognostic methods.
SUMMARY
The invention provides for a method of generating a subset of genes for use in predicting a phenotype in a subject.
In one embodiment the method comprises the steps of obtaining a set of expression values for a set of genes in a first sample and a second sample by measuring the level of expression in the two samples. A set of genes that are differentially expressed are identified by comparing the level of expression in the first sample with the level of expression in the second sample. An expression value that is increased or decreased in the first sample, as compared to the second sample is differentially expressed. A subset of genes for use in predicting a phenotype in a subject, wherein the subset is equal to or smaller than the set, is then identified by performing multivariate Cox analysis on the expression values for the set of genes which are differentially expressed.
In another embodiment, the method further comprises the step of obtaining a relative weight coefficient for each member of the gene set.
In another embodiment, the method further comprises the steps of obtaining a relative weight coefficient for each member of the gene set; and multiplying the expression value by the relative weight coefficient to obtain an individual survival score for each member of the gene set. In another embodiment, the sum of the individual survival scores is calculated to obtain a survival score.
In another embodiment, the method includes the step of logarithmically transforming the expression value of each member of the gene set prior to performing the multivariate Cox analysis. In another embodiment, the method comprises the steps of logarithmically transforming the expression value of each member of the gene set; obtaining a relative weight coefficient for each member of the gene set; and multiplying the logarithmically transformed expression value by the relative weight coefficient to obtain an individual survival score for each member of the gene set. The sum of the individual survival scores is calculated to obtain a survival score.
The invention also provides for a method of generating a subset of genes for use in predicting a phenotype in a subject comprising the following steps. A set of expression values for a set of genes in a first sample and a second sample is obtained by measuring the level of expression in the first sample and the second sample. A set of genes that are differentially expressed is identified by comparing the level of expression in the first sample with the level of expression in the second sample. An expression value that is increased or decreased in the first sample, as compared to the second sample is differentially expressed. A subset of genes for use in predicting a phenotype in a subject, wherein the subset is equal to or smaller than the set, is identified by performing multivariate Cox analysis on the expression values for the set of genes which are differentially expressed. A relative weight coefficient is obtained for each member of the gene set. The expression value of each member of the gene set is multiplied by the relative weight coefficient to obtain an individual survival score for each member of the set of genes. The sum of the individual survival scores is calculated to obtain a survival score. Survival analysis, is performed.
In one embodiment, the method may comprise the step of logarithmically transforming the expression value of each member of the gene set prior to performing the multivariate Cox analysis.
In one embodiment, the method of identifying a subset of genes comprises the additional steps of: identifying genes with a p value as determined by multivariate Cox analysis that is less than or equal to 0.25; obtaining a relative weight coefficient for each member of the gene set, multiplying the expression value of each member of the gene set by the relative weight coefficient to obtain an individual survival score for each member of the set of genes; calculating the sum of the individual survival scores to obtain a survival score; and performing survival analysis. hα one embodiment, the steps of identifying a set of genes that are differentially expressed, identifying a subset of genes for use in predicting a phenotype by performing multivariate Cox analysis, obtaining a relative weight coefficient for each member of the gene set, obtaining an individual survival score for each member of the gene set and obtaining a survival score are repeated.
In another embodiment, the method further comprises the following steps. Genes with a p value as determined by the multivariate Cox analysis that is less than or equal to 0.25 are identified. Multivariate Cox analysis is performed on the set of genes identified, wherein a gene with a p-value that is less than or equal to 0.1 is included in the subset. A relative weight coefficient for each member of the gene set is obtained. The expression value of each member of the gene set is multiplied by the relative weight coefficient to obtain an individual survival score for each member of the set of genes. The sum of the individual survival scores is calculated to obtain a survival score. Survival analysis is performed.
In one embodiment, the steps of identifying genes with a p value that is less than or equal to 0.1; obtaining a relative weight coefficient for each member of the gene set, obtaining an individual survival score for each member of the gene set; obtaining a survival score and performing survival analysis are repeated. Ih certain embodiments, a gene with a p-value, as determined by multivariate Cox analysis, that is less than or equal to 0.25, that is less than or equal to 0.1, that is less than or equal to 0.075 or that is less than or equal to 0.05 is included in the subset. The method may include the step of performing survival analysis, for example, Kaplan-Meier analysis.
In certain embodiments, a gene with a p-value, as determined by Kaplan-Meier analysis, that is less than or equal to 0.1, that is less than or equal to 0.075 or that is less than or equal to 0.05 is included in the subset.
The method can be performed with any of the sets of genes identified in Figures 3-7 and Table 3.
In one embodiment, the subset of genes includes at least one gene of any of the subsets identified in Figures 3-7 and Table 3. The invention provides for a method of using a subset of genes to predict a phenotype in a subject comprising the following steps. A sample is isolated from a subject; and analyzed for expression of the subset of genes.
The phenotype is selected from the group consisting of disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non- recurrence, invasiveness, non-invasiveness, metastatic, non-metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, and disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure.
In one embodiment, the subset of genes is any one of the sets or subsets identified in Figures 3-7 and Table 3.
The invention also provides a method of determining the relevance of a set of genes comprising the following steps. A set of expression values for a set of genes in a first sample and a second sample is obtained by measuring the level of expression in the first sample and the second sample. A set of genes that are differentially expressed is identified by comparing the level of expression in the first sample with the level of expression in the second sample, wherein an expression value that is increased or decreased in the first sample, as compared to the second sample is differentially expressed. A subset of genes for use in predicting a phenotype in a subject, wherein the subset is equal to or smaller than the set, is identified by performing multivariate Cox analysis on the expression values for the set of genes which are differentially expressed. A relative weight coefficient is obtained for each member of the gene set. The expression value of each member of the gene set is multiplied by the relative weight coefficient to obtain an individual survival score for each member of the set of genes. The sum of the individual survival scores is calculated to obtain a survival score; and survival analysis is performed. The invention also provides for a subset of genes comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3.
The invention also provides for a subset of genes for use in predicting a phenotype of a subject comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3. The invention also provides for a subset of genes comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3, wherein the subset of genes is generated by the methods described herein.
The invention also provides for a composition comprising a set of probes that hybridize to at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3.
The invention also provides for a subset of genes generated by the methods described herein.
The invention also provides for a combination of gene subsets, including a combination comprising at least two of the subsets presented in Figures 3-7 and Table 3. In one embodiment, each subset of the combination comprises at least one gene of any of the subsets identified in Figures 3-7 and Table 3.
The invention also provides for a kit. In one embodiment, a kit comprises at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3-7 and Table 3. In another embodiment, a kit comprises a set of reagents for detecting the expression of at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3- 7 and Table 3. In another embodiment, the kit comprises a set of probes that hybridize to at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3- 7 and Table 3. In one embodiment, any one of the kits of the invention predicts the phenotype of a subject. DEFINITIONS
As used herein, a "set of genes" refers to a group of genes. A "set of genes" according to the invention can be identified by any method now known or later developed to assess gene expression, including but not limited to measurements relating to the biological processes of nucleic acid amplification, transcription, RNA splicing, and translation. Thus, direct and indirect measures of gene copy number (e.g., as by fluorescence in situ hybridization or other type of quantitative hybridization measurement, or by quantitative PCR), transcript concentration (e.g., as by Northern blotting, expression array measurements or quantitative RT-PCR), and protein concentration (e.g., by quantitative 2-D gel electrophoresis, mass spectrometry, Western blotting, ELISA, or other method for determining protein concentration) are intended to be encompassed within the scope of the definition. In one embodiment, a "set of genes" refers to a group of genes that are differentially expressed in a first sample as compared to a second sample. As used herein, a "set of genes" refers to at least one gene, for example, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or more genes.
As used herein, a "set" refers to at least one.
As used herein, "differentially expressed" refers to the existence of a difference in the expression level of a nucleic acid or protein as compared between two sample classes, for example a first sample and a second sample as defined herein. Differences in the expression levels of "differentially expressed" genes preferably are statistically significant. Preferably, there is a 2-fold or more (for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 500, 1000-fold or more) increase or decrease in the expression levels of differentially expressed nucleic acid or protein. In one embodiment, there is at least a 5% (for example 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99, 100%) increase or decrease in the expression levels of differentially expressed nucleic acid or protein.
As used herein, "expression" refers to any one of RNA, cDNA, DNA, or protein expression.
"Expression values" refer to the amount or level of expression of a nucleic acid or protein according to the invention. Expression values are measured by any method known in the art and described herein. As used herein, "increased" refers to 2-fold or more (for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 500, 1000-fold or more) greater than. "Increased" also refers to at least 5% or more (for example 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99, 100%) greater than. As used herein, "decreased" refers to 2-fold or more (for example, 2, 3, 4, 5, 6, 7, 8, 9,
10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 500, 1000-fold or more) less than. "Decreased" also refers to at least 5% or more (for example 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99, 100%) less than.
As used herein, a "subset of genes" refers to at least one gene of a "set of genes" as defined herein. A subset of genes is predictive of a particular phenotype, for example, disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non- metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure.
As used herein, "predictive" means that a set of genes or a subset of genes according to the invention, is indicative of a particular phenotype of interest (for example disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non- metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure). A subset of genes, according to the invention that is "predictive" of a particular phenotype correlates with a particular phenotype at least 10% or more, for example 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, 50, 51, 52, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99 or 100%. As used herein, a "phenotype" refers to any detectable characteristic of an organism.
Preferably, a "phenotype" refers to disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non-metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure.
As used herein, "diagnosis" refers to a process of determining if an individual is afflicted with a disease or ailment.
"Prognosis" refers to a prediction of the probable occurrence and/or progression of a disease or ailment, as well as the likelihood of recovery from a disease or ailment, or the likelihood of ameliorating symptoms of a disease or ailment or the likelihood of reversing the effects of a disease or ailment. "Prognosis" is determined by monitoring the response of a patient to therapy.
As used herein, preferably a "first sample" and a "second sample" differ with respect to a phenotype, as defined herein. A "first sample" refers to a sample from a normal subject or individual, or a normal cell line.
An "individual" "or "subject" includes a mammal, for example, human, mouse, rat, dog, cow, pig, sheep etc...A "subject" includes both a patient and a normal individual.
As used herein, "patient" refers to a mammal who is diagnosed with a disease or ailment.
As used herein, "normal" refers to an individual who has not shown any disease or ailment symptoms or has not been diagnosed by a medical doctor.
A "second sample" refers to a sample from a patient or an unclassified individual, or an animal model for a disease of interest. A "second sample" also refers to a sample from a cell line that is a model for a disease of interest, for example a tumor cell line.
"Tumor" is to be construed broadly to refer to any and all types of solid and diffuse malignant neoplasias including but not limited to sarcomas, carcinomas, leukemias, lymphomas, etc., and includes by way of example, but not limitation, tumors found within prostate, breast, colon, lung, and ovarian tissues. A "tumor cell line" refers to a transformed cell line derived from a tumor sample. Usually, a "tumor cell line" is capable of generating a tumor upon explant into an appropriate host. A "tumor cell line" line usually retains, in vitro, properties in common with the tumor from which it is derived, including, e.g., loss of differentiation or loss of contact inhibition, and will undergo essentially unlimited cell divisions in vitro.
A "control cell line" refers to a non-transformed, usually primary culture of a normally differentiated cell type. In the practice of the invention, it is preferable to use a "control cell line" and a "tumor cell line" that are related with respect to the tissue of origin, to improve the likelihood that observed gene expression differences or differences in RNA or protein levels, are related to gene expression changes underlying the transformation from control cell to tumor.
An "unclassified sample" refers to a sample for which classification is obtained by applying the methods of the present invention. An "unclassified sample" may be one that has been classified previously using the methods of the present invention, or through the use of other molecular biological or pathohistological analyses. Alternatively, an "unclassified sample" may be one on which no classification has been carried out prior to the use of the sample for classification by the methods of the present invention.
In a preferred embodiment, the fold expression change or differential expression data are logarithmically transformed. As used herein, "logarithmically transformed" means, for example, 1Og10 transformed.
As used herein, "multivariate analysis" refers to any method of determining the incremental, statistical power of the members of a set of genes to predict a phenotype of interest. Methods of "multivariate analysis" useful according to the invention include but are not limited to multivariate Cox analysis. As used herein, "multivariate Cox analysis" refers to Cox proportional hazard survival regression analysis as performed by using the program presented at the world wide web at http://members.aol.com/johnp71/prophaz.html, and as described in Glinsky et al., 2005, J. Clin. Investig. 115:1503.
As used herein, "survival analysis" refers to a method of verifying that a set of genes or a subset of genes according to the invention is "predictive", as defined herein, of a particular phenotype of interest. "Survival analysis" takes the survival times of a group of subjects (usually with some kind of medical condition) and generates a survival curve, which shows how many of the members remain alive over time. Survival time is usually defined as the length of the interval between diagnosis and death, although other "start" events (such as surgery instead of diagnosis), and other "end" events (such as recurrence instead of death) are sometimes used.
Survival is often influenced by one or more factors, called "predictors" or
"covariates", which may be categorical (such as the kind of treatment a patient received) or continuous (such as the patient's age, weight, or the dosage of a drug). For simple situations involving a single factor with just two values (such as drug vs placebo), there are methods for comparing the survival curves for the two groups of subjects. For more complicated situations, a special kind of regression that allows for assessment of the effect of each predictor on the shape of the survival curve is required.
A "baseline" survival curve is the survival curve of a hypothetical "completely average" subject ~ someone for whom each predictor variable is equal to the average value of that variable for the entire set of subjects in the study. This baseline survival curve does not have to have any particular formula representation; it can have any shape whatever, as long as it starts at 1.0 at time 0 and descends steadily with increasing survival time.
The baseline survival curve is then systematically "flexed" up or down by each of the predictor variables, while still keeping its general shape. The proportional hazards method (for example Cox Multivariate analysis) computes a "coefficient", or "relative weight coefficient" for each predictor variable that indicates the direction and degree of flexing that the predictor has on the survival curve. Zero means that a variable has no effect on the curve - - it is not a predictor at all; a positive variable indicates that larger values of the variable are associated with greater mortality. Knowing these coefficients, a "customized" survival curve for any particular combination of predictor values is constructed. More importantly, the method provides a measure of the sampling error associated with each predictor's coefficient. This allows for assessment of which variables' coefficients are significantly different from zero; that is: which variables are significantly related to survival.
Multivariate Cox analysis is used to generate a "relative weight coefficient". As used herein, a "relative weight coefficient" is a value that reflects the predictive value of each gene comprising a gene set of the invention. Multivariate Cox analysis computes a "relative weight coefficient" for each predictor variable; for example, each gene of a gene set, that indicates the direction and degree of flexing that the predictor has on a survival curve. Zero means that a variable has no effect on the curve and is not a predictor at all. A positive variable indicates that larger values of the variable are associated with greater mortality. Knowing these "relative weight coefficients" a survival curve can be constructed for any combination of predictor values.
As used herein, a "correlation coefficient" means a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, there is a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, there is a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables.
Any one of a number of commonly used correlation coefficients may be used, including correlation coefficients generated for linear and non-linear regression lines through the data. Representative correlation coefficients include the correlation coefficient, pX;y; that ranges between -1 and+1, such as is generated by Microsoft Excel's CORREL function, the Pearson product moment correlation coefficient, r, that also ranges between- 1 and+1, that reflects the extent of a linear relationship between two data sets, such as is generated by Microsoft Excel's PEARSON function, or the square of the Pearson product moment correlation coefficient, r2, through data points in known y's and known x's, such as is generated by Microsoft Excel's RSQ function. The r2 value can be interpreted as the proportion of the variance in y attributable to the variance in x.
In one embodiment, a correlation coefficient, px,y; is greater than or equal to 0.8, or is greater than or equal to 0.9, or is greater than or equal to 0.95, or is greater than or equal to 0.995. One of ordinary skill can readily work out equivalent values for other types of transformations (e.g. natural log transformations) and other types of correlation coefficients either mathematically, or empirically using samples of known classification.
In a refinement of this preferred embodiment, the magnitude of the correlation coefficient can be used as a threshold for classification. The larger the magnitude of the correlation coefficient, the greater the confidence that the classification is accurate. As one of ordinary skill readily will appreciate, the appropriate threshold can be determined through the use of test data that seek to classify samples of known classification using the methods of the present invention. The threshold is adjusted so that a desired level of accuracy (e.g., greater than about 70% or greater than about 80%, or greater than about 90% or greater than about 95% or greater than about 99% accuracy is obtained). This accuracy refers to the likelihood that an assigned classification is correct. Of course, the tradeoff for the higher confidence is an increase in the fraction of samples that are unable to be classified according to the method. That is, the increase in confidence comes at the cost of a loss in sensitivity.
According to one embodiment of the invention, the expression value, or logarithmically transformed expression value for each member of a set of genes is multiplied by a "relative weight coefficient", as defined herein and as determined by multivariate Cox analysis, to provide an "individual survival score" for each member of a set of genes.
As used herein, a "survival score" refers to the sum of the individual survival scores for each member of a set of genes of the invention.
"Survival analysis" includes but is not limited to Kaplan-Meier Survival Analysis. In one embodiment, Kaplan-Meier survival analysis is carried out using GraphPad
Prism version 4.00 software (GraphPad Software) or as described in Glinsky et al., 2005, supra. Statistical significance of the difference between the survival curves for different groups of patients is assessed using Chi square and Logrank tests.
A p-value according to the invention is less than or equal to 0.25, preferably less than or equal to 0.1 and more preferably, less than or equal to 0.075, for example, 0.075, 0.070, 0.065, 0.060, 0.055, 0.050 etc... and most preferably less than or equal to 0.05, for example, 0.05, 0.045, 0.040, 0.035, 0.020, 0.010 etc...A "p-value" as used herein refers to a p-value generated for a set of genes by multivariate Cox analysis. A "p-value" as used herein also refers to a p-value for each member of a set of genes. A "p-value" also refers to a p-value derived from Kaplan-Meier analysis, as defined herein. A "p-value" of the invention is useful for determining if a set of genes or a subset of genes of the invention is predictive of a phenotype.
A "combination of gene sets" refers to at least two gene sets according to the invention. A "combination of gene subsets" refers to at least two gene subsets according to the invention. As used herein, the term "probe" refers to a labeled oligonucleotide which forms a duplex structure with a gene in a gene set or gene subset of the invention, due to complementarity of at least one sequence in the probe with a sequence in the gene. Probes useful for the formation of a cleavage structure according to the invention are between about 17-40 nucleotides in length, preferably about 17-30 nucleotides in length and more preferably about 17-25 nucleotides in length.
As used herein, a "primer" or an "oligonucleotide primer" refers to a single stranded DNA or RNA molecule that is hybridizable to a gene in a gene set or gene subset of the invention and primes enzymatic synthesis of a second nucleic acid strand. Oligonucleotide primers useful according to the invention are between about 10 to 100 nucleotides in length, preferably about 17-50 nucleotides in length and more preferably about 17-45 nucleotides in length. i
BRIEF DESCRIPTION OF THE DRAWINGS Figure 1. Classification of prostate cancer patients into sub-groups with distinct therapy outcome based on expression profile of the 11 -gene MTTS/PNS signature. Figure 1 shows the Kaplan-Meier survival curves for 79 prostate cancer patients stratified into distinct subgroups using a weighted survival predictor score algorithm.
Figure 2. Classification of patients diagnosed with four different types of epithelial cancer into sub-groups with distinct therapy outcome based on expression profile of the 11 -gene MTTS/PNS signature. Figures 2A-D show the Kaplan-Meier survival curves for breast cancer patients and ovarian cancer patients stratified into distinct sub-groups using weighted survival predictor score algorithm.
Figure 3A-3Q-1. Identification and analysis of cyclin Dl gene signatures. Figure 4A-4C-1. Identification and analysis of Myc gene signatures.
Figure 5A-5V-2. Identification and analysis of 100 most variable loci gene signatures. Figure 6A-6K-1. Identification and analysis of 14q32regulon gene signatures. Figure 7A-7R-5. Identification and analysis of Suzl2 gene signatures. DETAILED DESCRIPTION
Tumors can be extremely heterogeneous due to genomic instability that leads to continuously emerging phenotypic diversity, clonal evolution, and clonal selection that occurs during malignant progression. Associated with the phenotypic diversity of cancer cells are significant changes in gene expression due to mutations. However, not all mutations and differences in gene expression are crucial or even relevant to the malignant phenotype. It is important to identify expression changes that are highly relevant and characteristic of malignant phenotypes and progression pathways (Hanahan, D., Weinberg, R. A. The hallmarks of cancer. Cell. 2000. 100: 57-70, incorporated herein by reference.). The invention provides methods for identifying expression changes that are highly correlated with, and predictive of certain clinically relevant features of malignant phenotypes and progression pathways.
I. Obtaining Expression Values
Expression values for any member of a gene set or subset according to the invention can be obtained by any method now known or later developed to assess gene expression, including but not limited to measurements relating to the biological processes of nucleic acid amplification, transcription, RNA splicing, and translation. Direct and indirect measures of gene copy number (e.g., as by fluorescence in situ hybridization or other type of quantitative hybridization measurement, or by quantitative PCR), transcript concentration (e.g., by Northern blotting, expression array measurements or quantitative RT-PCR), and protein concentration (e.g., by quantitative 2-D gel electrophoresis, mass spectrometry, Western blotting, ELISA, or other method for determining protein concentration) are intended to be encompassed within the scope of the definition.
Tissue Processing for mRNA and RNA Isolation Fresh frozen orthotopic tumor is examined by use of hematoxylin and eosin stained frozen sections. Typically, orthotopic tumors of all sublines exhibit similar morphology consisting of sheets of monotonous closely packed tumor cells with little evidence of differentiation interrupted by only occasional zones of largely stromal components, vascular lakes, or lymphocytic infiltrates. Fragments of tumor judged free of these non-epithelial clusters are used for mRNA preparation. Frozen tissue (1-3 mm x 1-3 mm) is submerged in liquid nitrogen in a ceramic mortar and ground to powder. The frozen tissue powder is dissolved and immediately processed for mRNA isolation using a FastTract kit for mRNA extraction (ϋivitrogen, Carlsbad, Calif.) according to the manufacturer's instructions.
RNA and mRNA Extraction
For gene expression analysis, cells are harvested in lysis buffer 2 hrs after the last media change at 70-80% confluence and total RNA or mRNA is extracted using the RNeasy (Qiagen, Chatsworth, Calif.) or FastTract kits (Invitrogen, Carlsbad, Calif.). Cell lines are not split more than 5 times prior to RNA extraction, except where noted.
Asymetrix Arrays
The protocol for mRNA quality control and gene expression analysis is as recommended by Affymetrix (http://www.affymetrix.com). In brief, approximately one microgram of mRNA is reverse transcribed with an oligo(dT) primer that has a T7 RNA polymerase promoter at the 5' end. Second strand synthesis is followed by cRNA production incorporating a biotinylated base. Hybridization to Affymetrix U95 Av2 arrays representing 12,625 transcripts overnight for 16 h is followed by washing and labeling using a fluorescently labeled antibody. The arrays are read and data processed using Affymetrix equipment and software as reported previously (LaTulippe et al., 2002, Cancer Res. 62:4499; Glinsky et al., 2003 Molecular Carcinogenesis 37 :209).
Data Analysis
Detailed protocols for data analysis and documentation of the sensitivity, reproducibility and other aspects of the quantitative statistical microarray analysis using Affymetrix technology have been reported (LaTulippe et al., 2002, Cancer Res. 62:4499; Glinsky et al., 2003 Molecular Carcinogenesis 37:209). Analysis of gene expression is performed using Affymetrix MicroDB v.3.0 and DMT v.3.0 software or the Affymetrix Microarray Suite v.5.0 software. Statistical analysis of expression data set is performed using the Affymetrix MicroDB and Affymetrix DMT software. In certain embodiments, Pearson correlation coefficients are determined using the Microsoft Excel software as described in the signature discovery protocol.
Q-RT-PCR analysis
The real-time PCR method measures the accumulation of PCR products with a fluorescence detector system and allows for quantification of the amount of amplified PCR products in the log phase of the reaction. Total RNA is extracted using RNeasy Mini Kit (QIAGEN) according to the manufacturer's instructions. A measure of 1 μg (tumor samples), or 2 μg and 4 μg (independent preparations of reference cDNA samples), of total RNA is then used as a template for cDNA synthesis with Superscript II (Invitrogen Corp.). Q-RT-PCR primer sequences are selected for each cDNA with the aid of Primer Express software (Applied Biosystems). PCR amplification is performed with gene-specific primers.
Q-RT-PCR reactions and measurements are performed with SYBR Green and ROX (Applied Biosystems) as a passive reference, using the ABI 7900HT Sequence Detection System (Applied Biosystems). Conditions for the PCR are, for example, as follows: 1 cycle of 10 minutes at 95°C; and 40 cycles of 0.20 minutes at 940C, 0.20 minutes at 6O0C, and 0.30 minutes at 72°C. The results are normalized to the relative amount of expression of an endogenous control gene, for example, GAPDH.
II. Gene Sets
In one embodiment, the methods of the invention use gene expression data from a set of tumor cell lines and compare those data with gene expression data from a set of control cell lines to identify those genes that are differentially expressed in the tumor cell lines as compared to the control cell lines, hi preferred embodiments, each of these sets includes more than a single member, although it is contemplated to be within the scope of the present invention to practice embodiments in which either or both of the set of tumor cell lines and the set of control cell lines includes only one member. The identified genes are referred to as a set of expressed genes. Preferably, the control cell line and the tumor cell lines are related insofar as the control cell lines represent physiologically normal cells from the tissue or organ from which the tumor represented by the tumor cell lines arose. For example, if the tumor cell lines are derived from a prostate tumor, the control cell lines preferably are primary cultures of normal prostate epithelial cells. In the preferred embodiments, more than one tumor cell line and more than one control cell line is used to generate the set of genes so as to reduce the number of genes in the set by eliminating those genes that are not consistently differentially expressed between the tumor and control cell lines.
In other embodiments, the method may be practiced using only one tumor cell line and one control cell line, and identifying the set of genes differentially expressed between the tumor cell line and the control cell line. However, by carrying out a series of comparisons between multiple control cell lines and multiple tumor cell lines, the set is more likely to contain only those genes that are consistently differentially expressed between the normal and tumor classes of cell lines (i.e., a gene is included within the set if its expression level is always higher or lower in each of the tumor cell lines examined as compared to each of the control cell lines examined).
In yet another embodiment, the methods of the invention are practiced without the use of cell lines, using instead data derived only from clinical samples. In a similar manner, the methods of the invention may be practiced using only data derived from cell lines.
In an embodiment in which a set of genes is derived, for example, using data obtained from three separate control cell lines and six separate tumor cell lines, for each gene considered for inclusion within the set of genes, pairwise comparisons are carried out for each of the 3x6 or 18 pairwise combinations between control cell lines and tumor cell lines. A candidate gene will be included in the set if each of the 18 pairwise comparisons reveals the gene to be consistently differentially expressed (i.e., gene expression always is higher in the control cell line or always higher in the tumor cell line for each of the 18 pairwise comparisons). As one of ordinary skill readily will appreciate, it may sometimes be necessary to scale the datasets prior to carrying out the pairwise comparisons. Such scaling may be routinely implemented in the analysis software provided by commercial suppliers of expression arrays or array readers (such as, e.g., Affymetrix, Santa Clara, Calif.). For a general discussion of data scaling for and differential gene expression analysis, see, e.g., Affymetrix Microarray Suite 4.0 User Guide, Affymetrix, Santa Clara, Calif., incorporated herein by reference.
A set of genes according to the invention is therefore a set of genes that have met a screening criterion requiring that the genes be differentially expressed between at least tumor and control cell lines. This criterion reflects the hypothesis that differences in the tumor and control cell phenotypes are driven, at least in part, by differences in gene expression patterns in the tumor and control cells.
Because the tumor and control cell lines have at some point been cultured in vitro, their gene expression patterns likely will not exactly correspond with the expression patterns of their counterparts grown in vivo. Consequently, the methods of the invention may use additional steps to establish a set of expressed genes that are differentially expressed in cells of biological samples that differ with respect to a classification. The classification may be an outcome predictor or cellular phenotype or any type of classification that may be used for classifying biological samples. The classification may be binary (i.e., for two mutually exclusive classes such as, invasive/non-invasive, metastatic/non-metastatic, etc.), or may be continuously or discretely variable (i.e., a classification that can assume more than two values such as, e.g., Gleason scores, survival odds, etc.) The only requirement is that the classified trait must be something that can be observed and characterized by the assignment of a variable or other type of identifier so that samples belonging to the same class may be grouped together during the analysis.
A set of expressed genes may also be obtained following essentially the same techniques described above, except sets of samples obtained from in vivo sources are used instead of sets of cell lines. In embodiments of the invention directed toward tumor analysis, classification or prognostication, the sample sets preferably consist of tumor samples obtained from patients that are analyzed without any intervening tissue culturing steps so that the gene expression patterns reflect as closely as possible the pattern within cells growing in their undisturbed, in vivo environment. The goal is to obtain a reference set that includes genes differentially expressed between samples belonging to different classifications. As is the case with the first reference set, it is preferable to include several independent samples within a classified set and to carry out a plurality of pairwise comparisons to identify differentially expressed genes for inclusion into the second reference set.
For example, assume the classification of interest is invasiveness (e.g., turning on whether tumor-free surgical margins are observed), it is preferable to use as the sample sets a number of invasive samples and a number of non-invasive samples. The number of pairwise comparisons that can be carried out is of course equal to the product of the numbers of independent samples in each category. Ideally, each of these pairwise comparisons is carried out and the same criterion for determining differential expression described above is used to select genes for inclusion into a second reference set. It is contemplated that in certain instances, especially, e.g., when the variance within a sample set is low, it will not be necessary to carry out all pairwise comparisons to select genes for inclusion into a set of genes according to the invention. In practice, one of ordinary skill can readily determine whether it is advantageous to carry out all pairwise comparisons, or fewer than all pairwise comparisons by examining the convergence behavior of the reference sets as additional comparisons are carried out. If the sets apparently converge prior to completion of all possible pairwise comparisons, then the added benefit of exhaustive comparison may be small and so can be avoided. Similar principles drive the selection of the numbers of cell lines and cell samples used to derive a set of genes according to the invention as apply to the study of other cell and molecular biological phenomena. One of ordinary skill readily will appreciate that the accuracy of the reference sets can increase as more cell lines and samples are used so that statistical noise is minimized. It currently is contemplated that preferred numbers of different cell lines and samples per set used for calculating reference sets be in the range of 2 to 50 per set, or in the range of 2 to 25, or in the range of 2 to 10, or in the range of 3 to 5 per set. While not preferred, it also is contemplated to be within the scope of the present invention to use sets consisting of a single type of cell in one or more of the four sets of input cells used to calculate the first and second reference sets (i.e., tumor cell lines, control cell lines, first sample, and second sample). Direct statistical analysis using T-test and/or Mann- Whitney test for identification of genes differentially expressed in sets of biological samples that differ with respect to a classification is also applicable to the methods of the present invention. The average expression values for genes across the first and second sets of biological samples that differ with respect to a classification are used for calculation of fold expression changes (see below).
Gene sets and subsets of the invention are presented in Figures 3-7 and Table 3. Each gene of a gene set or subset presented herein is identified by at least one of a Genbank Accession number, a Unigene number, an Image clone ID (which can be used to identify the corresponding genbank number on the world wide web at image.llnl.gov/image/IQ/bin/buildIQ?categoryChoice=clone~Clone) or an Affymetrix probe set E).
III. Diseases Useful According To The Invention The methods of the invention are useful for studying any known cancer including but not limited to adrenal cancer, AIDS-related lymphoma, anal cancer, ataxia-telangiectasia, bladder cancer, brain tumors, breast cancer, cervical cancer, chronic lymphocytic leukemia, chronic myelogenous leukemia, colorectal cancer, craniopharyngioma, cutaneous T-cell lymphoma, endometrial and uterine cancer, esophageal cancer, Ewing's sarcoma, fallopian tube cancer, gallbladder cancer, gastric cancer, gestational trophoblastic disease, choriocarcinoma, Hairy cell leukemia, Hodgkin's disease, Kaposi's sarcoma, kidney cancer, laryngeal cancer, leukemia including acute lymphocytic leukemia and acute myelogenous leukemia, Li-Fraumeni syndrome, liver cancer, lung cancer, Hodgkin's lymphoma, Non- Hodgkin's lymphoma, medulloblastoma, melanoma, mesothelioma, metastases, myelomas, myeloproliferative disorders, neuroblastoma, Non-Hodgkin's disease, non-small cell lung cancer, oropharyngeal cancers, osteosarcoma, ovarian cancer, pancreatic cancer, parathyroid cancer, penile cancer, pituitary cancer, prostate cancer, retinoblastoma, rhabdomyosarcoma and other soft-tissue sarcomas, osteosarcoma, small intestine cancer, small-cell lung cancer, testicular cancer, thymoma, thyroid cancer, urethal cancer, vaginal cancer, vulvar cancer and Wilms'tumor.
In preferred embodiments, the methods of the invention are useful for studying the gene expression profiles that are predictive of prostate cancer, breast cancer and cancer metastasis.
Prostate Cancer
As many as 50% of men aged 70 years and over have microscopic foci of prostate cancer without clinical evidence of disease (Trump, D. L., Robertson, C. N., Holland, J. F., Frei, E., Bast, R. C, Kufe, D. W., Morton, D. L., and Weishselbaum, R. R., Neoplasms of the prostate. In: D. L. Trump, C. N. Robertson, J. F. Holland, E. Frei, R. C. Bast, D. W. Kufe, D. L. Morton, and R. R. Weishselbaum (eds.), Cancer Med, Vol. 3, pp. 1562-86. Philadelphia: Lea & Febiger, 1993.). Although some prostate cancers remain indolent and confined to the gland, other prostate cancers behave more aggressively and metastasize if not adequately treated. Prostate cancer is the second most lethal neoplasia in males after lung cancer.
Because of widespread screening programs utilizing serum PSA values, many more cases of early stage disease are being diagnosed.
Unfortunately, the only potentially curative therapy for prostate cancer consists of radical prostatectomy or other local therapies such as external irradiation, implanted irradiation seeds, or cryotherapy. The use of prostatectomy has increased in step with the amount of diagnosed early stage prostate cancer. SEER data indicates an increase in prostatectomies from 17.4 per 100,000 in 1988 to 54.6 per 100,000 in 1992. Insufficient treatment leads to local disease extension and metastasis. Current methods, such as Gleason scores are not perfectly reliably correlated with whether a tumor is aggressive or indolent. Thus, developing a treatment strategy appropriate for any individual is difficult. The recognition of those genetic changes that portend metastatic prostate cancer would, therefore, be a breakthrough. The methods of the present invention readily identify such genetic changes.
Breast Cancer
Breast cancer is the most common cancer among women in North America and Western Europe and is the second leading cause of female cancer death in the United States. In the United States, age-adjusted breast cancer incidence rates have considerably increased during last century. Approximately 40% of patients diagnosed with breast cancer have disease that has regional or distant metastases and, at present, there is no efficient curative therapy for breast cancer patients with advanced metastatic disease. Developing a treatment strategy appropriate for any individual with early stage disease is difficult and insufficient treatment leads to local disease extension and metastasis. Therefore, there is an urgent clinical need for novel diagnostic methods that would allow early identification of those breast cancer patients who are likely to develop metastatic disease and would require the most aggressive and advanced forms of therapy for increased chance of survival. The identification of those genetic changes that distinguish aggressive metastatic disease and predict metastatic behavior would, therefore, be a breakthrough. The methods of the present invention provide information that allows prognostication of aggressive metastatic disease.
Recent gene expression analysis of human tumor samples employing cDNA microarray technology underscores the difficulties in identification of the cellular origin of differentially expressed transcripts in clinical samples due to the remarkable cellular heterogeneity and variability in cellular compositions of human tumors (Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C, Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caliguri, M. A., Bloomfield, C. D., Lander, E. S. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286: 531-537; Perou C M, Jeffrey S S, van de Rijn M, et al. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA. 1999. 96:9212-9217; Perou C M, Sorlie T, Eisen M B, et al. Molecular portrait of human breast tumors. Nature. 2000. 406:747-752, incorporated herein by reference). However, a cDNA microarray analysis of gene expression in melanoma cell lines of distinct metastatic potential, was successfully employed for identification of RhoC as an essential gene for the acquisition of metastatic phenotype by melanoma cells (Clark, E A, Golub T R, Lander E S, Hynes R O. Genomic analysis of metastasis reveals an essential role for RhoC. Nature 2000. 406:532-535, incorporated herein by reference). Established human cancer cell lines were utilized for parallel comparisons of the alterations in DNA copy number and gene expression associated with human breast cancer (Pollack, J. R., Perou, C. M., Alizadeh, A. A., Eisen, M. B., Pergamenschikov, A., Williams, C. F., Jeffrey, S. S., Botstein, D., Brown, P. O. Genome-wide analysis of DNA-copy number changes using cDNA microarrays. Nature Genetics. 1999. 23: 41-46; Forozan, F., Mahlamaki, E. H., Monni, O., Chen, Y., Veldman, R., Jiang, Y., Gooden, G. C, Ethier, S. P., Kallioniemi, A., Kallioniemi, O-P. Comparative genomic hybridization analysis of 38 breast cancer cell lines: a basis for interpreting complementary DNA microarray data. Cancer Res. 2000. 60: 4519-4525, incorporated herein by reference). Thus, model systems are a reasonable source of gene candidates to be studied in the much more heterogeneous environment of real human tumors.
Analysis of gene expression in normal and neoplastic ovarian human tissues using methods of the present invention revealed that high malignant potential ovarian cancers exhibited gene expression profile somewhat similar to the ovarian cancer cell lines (Welsh, J. B., Zarrinkar, P. P., Sapinoso, L. M., Kern, S. G., Behling, C. A., Monk, B. J., Lockhart, D. J., Burger, R. A., Hampton, G. M. Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc Natl Acad Sci USA. 2001. 98:1176-1181, incorporated herein by reference), further validating the complementary gene expression analysis approach utilizing selected established cancer cell lines and clinical samples. Metastasis
Cancer cells have exceedingly low survival rates in the circulation (reviewed in Glinsky, G. V. 1993. Cell adhesion and metastasis: is the site specificity of cancer metastasis determined by leukocyte-endothelial cell recognition and adhesion? Crit. Rev. Onc./Hemat., 14: 229-278, incorporated herein by reference). Even if the bloodstream contains many cancer cells, there may be no clinical or pathohistological evidence of metastatic dissemination into the target organs (Williams, W. R. The theory of Metastasis. In The Natural History of Cancer. 1908; 442-448; Goldmami, E. 1907. The growth of malignant disease in man and the lower animals, with special reference to the vascular system. Proc. R. Soc. Med., 1: 1-13; Schmidt, M. B. In Die Verbreitungswege derKarzinome und die bezienhung generalisiertes sarkome su den leukamischen neubildungen. Fischer, Jena, 1903, incorporated herein by reference). The levels of metastatic efficiency at the intramicro vascular (postintravasation) phase of metastatic dissemination were shown to be only 0.2% and 0.003% in high and low metastatic variants of B16 melanoma cells, respectively, injected at a concentration of 105 cells into the tail veins of laboratory mice (Weiss, L. 1990. Metastatic inefficiency. Adv. Cancer Res., 54: 159-211; Weiss, L., Mayhew, E., Glaves-Rapp, D., Holmes, J. C. 1982. Metastatic inefficiency in mice bearing B16 melanomas. Br. J. Cancer, 45: 44-53, incorporated herein by reference). The fate of cancer cells in the circulation is a rapid phase of intramicrovascular cancer cell death, which is completed in <5 minutes and accounts for 85% of arrested cancer cells. This is followed by a slow phase of cell death, which accounts for the vast majority of the remainder (Weiss, L. 1988. Biomechanical destruction of cancer cells in the hart: a rate regulator of hematogenous metastasis. Invas. Metastasis, 8: 228-237; Weiss, L., Orr, F. W., Honn, K. V. 1988. Interactions of cancer cells with the microvasculature during metastasis. FSEB J., 2: 12-21; Weiss, L., Harlos, J. P., Elkin, G. 1989. Mechanism of mechanical trauma to Ehrlich ascites tumor cells in vitro and its relationship to rapid intravascular death during metastasis. Int. J. Cancer, 44: 143-148, incorporated herein by reference).
For example, the number of tumor cells in the lungs declined very rapidly after intravenous injection i.e., 90-99% had disappeared after 24 hours (Hewitt, H. B., Blake, A. 1975. Quantitative studies of translymphonodal passage of tumor cells naturally disseminating from a nonimmunogenic murine squamous carcinoma. Br. J. Cancer, 31: 25- 35; Fidler, I. J. 1970. Metastasis: quantitative analysis of distribution and fate of tumor emboli labeled with 1251-5 iodo-2'-deoxyuridine. J. Natl. Cancer Inst., 45: 773-782; Proctor, J. W. 1976. Rat sarcoma model supports both soil seed and mechanical theories of metastatic spread. Br. J. Cancer, 34: 651-654; Proctor, J. W., Auclair, B. G., Rudenstam, C. M. 1976. The distribution and fate of blood-born 125IudR-labeled tumor cells in immune syngeneic rats. Int. J. Cancer, 18: 255-262; Weston, B. J., Carter, R. L., Eastry, G. C, Connell, D. L, Davies, A. J. C. 1974. The growth and metastasis of an allografted lymphoma in normal, deprived and reconstituted mice. Int. J. Cancer, 14: 176-185; Kodama, M., Kodatna, T. 1975. Enhancing effect of hydrocortisone on hematogenous metastasis of Ehrlich ascites tumor in mice. Cancer Res., 35: 1015-1021, incorporated herein by reference) and after 3 days generally less than 1% remained (Fidler, I. J. 1970. Metastasis: quantitative analysis of distribution and fate of tumor emboli labeled with 1251-5 iodo-2'-deoxyuridine. J. Natl. Cancer Inst., 45: 773-782; Weston, B. J., Carter, R. L., Eastry, G. C, Connell, D. I, Davies, A. J. C. 1974. The growth and metastasis of an allografted lymphoma in normal, deprived and reconstituted mice. Int. J. Cancer, 14: 176-185; Kodama, M., Kodama, T. 1975. Enhancing effect of hydrocortisone on hematogenous metastasis of Ehrlich ascites tumor in mice. Cancer Res., 35: 1015-1021, incorporated herein by reference). This decline is due to a rapid degeneration of cancer cells (Fidler, I. J. 1970. Metastasis: quantitative analysis of distribution and fate of tumor emboli labeled with 1251-5 iodo-2'-deoxyuridine. J. Natl. Cancer Inst., 45: 773-782; Roos, E., Dingemans, K. P. 1979. Mechanisms of metastasis. Biochim. Biophys. Acta, 560: 135-166, incorporated herein by reference). Therefore, the individual "average" cancer cell survives only a short time in the circulation. The successful metastatic cancer cells are able to find a largely unknown survival and escape route. Patients at high risk for metastatic disease could be better managed if gene expression patterns correlated with a clinical metastatic phenotype are identified. The methods of the present invention identify such gene expression patterns. Patients' tumor samples can be tested to see whether the gene expression pattern is associated with an increased risk of metastasis, and if so, the patients can be treated with more aggressive therapies to lower the risk of metastasis. As explained in greater detail below, the present invention provides for methods that allow identification of such gene expression patterns, and sample classification based on those patterns.
IV. Multivariate Analysis and Weighted Survival Predictor Score Analysis
The invention provides for identifying a subset of genes for use in predicting a phenotype in a subject by multivariate analysis.
In one embodiment, multivariate analysis is multivariate Cox analysis as described in Glinsky et al., 2005 J. Clin. Invest. 115 : 1503.
As used herein, "multivariate Cox analysis" refers to Cox proportional hazard survival regression analysis as performed by using the program presented at the world wide web at http://members.aol.com/johnp71/prophaz.html, and as described in Glinsky et al., 2005, J. Clin, hivestig. 115:1503.
The invention also provides for implementation of a weighted survival score analysis.
Weighted survival score analysis reflects the incremental statistical power of individual covariates as predictors of therapy outcome based on a multicomponent prognostic model. For example, microarray-based or Q-RT-PCR-derived gene expression values are normalized and log-transformed on a base 10 scale. The log-transformed normalized expression values for each data set are analyzed in a multivariate Cox proportional hazard regression model, with overall survival or event-free survival as the dependent variable. To calculate the survival/prognosis predictor score for each patient, the log-transformed normalized gene expression value measured for each gene are multiplied by a coefficient derived from the multivariate Cox proportional hazard regression analysis, for example a relative weight coefficient, as defined herein. Final survival predictor score comprises a sum of scores for individual genes and reflects the relative contribution of each of the genes in the multivariate analysis. The negative weighting values indicate that higher expression correlates with longer survival and favorable prognosis, whereas the positive score values indicate that higher expression correlates with poor outcome and shorter survival. Thus, the weighted survival predictor model is based on a cumulative score of the weighted expression values of all of the genes of a set of genes.
V. Survival Scores
The invention provides for an individual survival score for each member of a set of genes, calculated by multiplying the expression value or the logarithmically transformed expression value for each member of a set of genes by a relative weight coefficient or a correlation coefficient, as determined by multivariate Cox analysis. The invention also provides for a survival score, wherein a survival score is the sum of the individual survival scores for each member of a set of genes.
VI. Survival Analysis
Survival analysis refers to a method of verifying that a set of genes or a subset of genes according to the invention is "predictive", as defined herein, of a particular phenotype of interest.
Survival analysis includes but is not limited to Kaplan-Meier survival analysis.
In one embodiment, the Kaplan-Meier survival analysis is carried out using the Prism 4.0 software. Statistical significance of the difference between the survival curves for different groups of patients was assessed using Chi square and Logrank tests.
In another embodiment, the Kaplan-Meier survival analysis is carried out using GraphPad Prism version 4.00 software (GraphPad Software). The endpoint for survival analysis in prostate cancer is the biochemical recurrence defined by the serum prostate- specific antigen (PSA) increase after therapy. Disease-free interval is defined as the time period between the date of radical prostatectomy (RP) and the date of PSA relapse (for the recurrence group) or the date of last follow-up (for the non-recurrence group). Statistical significance of the difference between the survival curves for different groups of patients is assessed using X2 and log-rank tests. To evaluate the incremental statistical power of the individual covariates as predictors of therapy outcome and unfavorable prognosis, both univariate and multivariate Cox proportional hazard survival analysis can be performed.
The major mathematical complication with survival analysis is that you usually do not have the luxury of waiting until the very last subject has died of old age; you normally have to analyze the data while some subjects are still alive. Also, some subjects may have moved away, and may be lost to follow-up, hi both cases, the subjects were known to have survived for some amount of time (up until the time the one performing the analysis last saw them). However, the one performing the analysis may not know how much longer a subject might ultimately have survived. Several methods have been developed for using this "at least this long" information to preparing unbiased survival curve estimates, the most common being the Life Table method and the method of Kaplan and Meier Analysis, as defined herein.
VII. Use
The methods of the invention and the subset of genes of the invention are useful for identifying a subset of genes for use in predicting a phenotype of the invention, for example disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non- metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non-relapse, therapy failure and cure. The methods of the invention are also useful for predicting a phenotype in a subject.
For example, the gene subsets of the invention are useful for predicting the interval to disease recurrence, distant metastasis, and death after therapy.
The methods of the invention are useful for identifying markers of malignant phenotypes for diagnostic and prognostic purposes, as well as for drug discovery purposes.
VIII. Kits The gene sets of the invention may be assembled into kits for use in predicting a phenotype in a subject. Kits according to this aspect of the invention comprise a carrier, such as a box, carton, tube or the like, having in close confinement therein one or more container means, such as vials, tubes, ampules, bottles and the like. The kits of the invention may also comprise (in the same or separate containers) one or more suitable buffers, one or more primers or probes, for example any of the probes identified in Figures 3-7 or Table 3, any probe that specifically binds to any of the genes presented in Figures 3-7 or Table 3, or any other reagents described for the present invention.
IX. EXAMPLES
Example 1 Generation of a Subset of Genes for Use in Predicting a Phenotype in a Subject
A subset of genes for use in predicting a phenotype in a subject can be generated as follows. A set of expression values for a set of genes are obtained for a first and a second sample by measuring the level of expression in the first and second samples according to any method known in the art and described herein.
Genes that are differentially expressed are identified by comparing the level of expression in the first sample with the level of expression in the second sample. An expression value that is increased or decreased in the first sample as compared to the second sample is differentially expressed.
A subset of genes for use in predicting a phenotype in a subject is identified by performing multivariate Cox analysis on the expression values for the set of differentially expressed genes. Multivariate Cox analysis is used to derive a relative weight coefficient for each member of the gene set. An individual survival score for each member of the gene set is calculated by multiplying the expression value by the relative weight coefficient for each member of the gene set. The sum of the individual survival scores is calculated to provide a survival score for the set of genes, or the subset of genes. A gene with a p-value less than or equal to a predetermined value, wherein the p-value is determined by multivariate Cox analysis, is included in the subset. EXAMPLE 2
Validation of the 11-gene Death From Cancer Signature This example shows the application of the weighted survival score algorithm to the 11-gene death from cancer signature (Glinsky et al., JCI, 2005, 115:1503) for calculation of individual prognostic indices predicting the therapy outcome and survival of cancer patients. Figure 1 shows the Kaplan-Meier survival curves for 79 prostate cancer patients stratified into distinct sub-groups using a weighted survival predictor score algorithm. Figures 2A and 2B - 2D show the Kaplan-Meier survival curves for breast cancer patients and ovarian cancer patients stratified into distinct sub-groups using weighted survival predictor score algorithm.
Cox proportional hazards survival regression analysis
To ascertain the incremental statistical power of the individual covariates as predictors of therapy outcome and unfavorable prognosis, both univariate and multivariate Cox proportional hazard survival analyses (Table 4) were performed. Several individual gene members of the 11-gene signature, such as KI67 and Cy din Bl, have been described previously as significant predictors of cancer prognosis and may reflect a correlation between proliferative fraction and poor therapy outcome as has been shown recently for the lymphoma survival predictor signature. However, the analysis appears to indicate that the 11-gene signature is a more uniform therapy outcome predictor across the multiple data sets compared to the individual genes (see below) and, perhaps, is a better "integrator" and "sensor" of the biological diversity across the spectrum of human cancers.
Both univariate and multivariate Cox proportional hazard survival analyses were performed to compare the prognostic performance of the entire sternness signature and individual genes (Table 4A). In the univariate analysis, the prognostic performance of the KI67 expression as a predictor of therapy outcome varied in different outcome data sets. It was highly significant in the prostate cancer therapy outcome set 2 (MSKCC data set); however, it showed only a trend toward statistical significance in the prostate cancer outcome set 1 (P = 0.1; MIT data set) and breast cancer outcome data set (P = 0.0533). In prostate cancer, the significant prognosis predictors in univariate Cox regression analysis were KI67, ANK3, FGFR2, CES, and the 11-gene MTTS/PNS signature. In breast cancer, the significant prognosis predictors in univariate analysis were Cy din Bl, BUBl, HEC, and the 11-gene signature. Thus, the analysis seems to indicate that individual genes demonstrate a variable performance across multiple outcome data sets and no single gene was identified that was uniformly predictive of the poor therapy outcome. In the multivariate analysis (Table 4), the most significant prostate cancer recurrence predictor was the model that included 11 covariates (11 -gene signature, four individual genes (KI67; ANK3; FGFR2; CESl); and six clinico-pathological features (pre RP Gleason sum; surgical margins; seminal vesicle invasion; age; and extra-capsular extension)). Interestingly, several covariates such as the 11-gene signature, KI67, CESl, pre RP PSA level, surgical margins, and extra capsular extension remained statistically significant prognostic markers in the multivariate analysis (Table 4B). Thus, while prognostic performance of individual gene members of the 11-gene signature varied greatly in different outcome data sets, the identified 11-gene signature seems to perform as the most consistent predictor of poor therapy outcome across multiple independent outcome data sets comprising over 1,000 clinical samples and representing ten distinct types of human cancer. Yet the statistically best-performing multivariate cancer type-specific model seems to require a combination of calls based on expression levels of individual genes, a gene expression signature, and clinico-pathological covariates (Table 4). An alternative statistical metric was used to further evaluate the prognostic power of the genes comprising the 11-gene signature. The weighted survival score analysis was implemented to reflect the incremental statistical power of the individual covariates as predictors of therapy outcome based on a multi-component prognostic model (Figure 1). Figure 1 shows the Kaplan-Meier survival curves for 79 prostate cancer patients stratified into distinct sub-groups using weighted survival predictor score algorithm.
Final survival predictor score comprises a sum of scores for individual genes and reflects the relative contribution of each of the eleven genes in the multivariate analysis. The negative weighting values imply that higher expression correlates with longer survival and favorable prognosis, whereas the positive score values indicate that higher expression correlates with poor outcome and shorter survival. Application of the weighted survival predictor model based on a cumulative score of the weighted expression values of eleven genes confirmed the prognostic power of the identified 11-gene signature in stratification of prostate cancer patients into sub-groups with statistically distinct probability of relapse-free survival after radical prostatectomy (Figure 1). Expression of the 11-gene MTTS/PNS signature predicts metastatic recurrence and poor survival after therapy in breast cancer and lung adenocarcinoma patients diagnosed with an early stage disease. BMI-I expression was previously implicated in human breast and lung cancers (Vonlanthen et al., 2001 Br. J. Cancer 84:1372; Dimri et al., 2002 Cancer Res. 62:4736; Raaphorst et al., 2003 Neoplasia 5:481), suggesting that activation of 2?M/-i-asociated pathway (s) might be relevant for these types of carcinomas as well. It was determined if measurements of expression of the 11 -gene MTTS/PNS signature would be informative in the prediction of the patient's prognosis in the group of 97 young women diagnosed with sporadic lymph-node-negative early stage breast cancer (this group comprises 46 patients who developed distant metastases within 5 years and 51 patients who continued to be disease- free at least 5 years after therapy; they constitute clinically defined poor prognosis and good prognosis groups, correspondingly) and analyzed in the recent expression profiling study of the early stage breast cancer (van't Veer et al., 2002 Nature 415:530). Kaplan-Meier analysis indicates that breast cancer patients with tumors displaying a stem cell-like expression profile of the 11 -gene signature have a significantly higher probability of developing distant metastases within 5 years after therapy and therefore can be identified as a poor prognosis sub-group (data not shown). Median metastasis-free survival after therapy in the poor prognosis sub-group of breast cancer patients defined by the 11-gene signature was 26 months. 84 % of patients in the poor prognosis sub-group were diagnosed with distant metastasis within 5 years after therapy (data not shown). In contrast, 62 % of patients in the good prognosis sub-group remained metastasis-free (data not shown). The estimated hazard ratio for metastasis-free survival after therapy in the poor prognosis sub-group as compared with the good prognosis sub-group of patients defined by the 11-gene signature was 3.762 (95% confidence interval of ratio, 3.421 to 20.27; P < 0.0001). Thus, expression pattern of the 11-gene MTTS/PNS signature is strongly predictive of a short post-diagnosis and post- treatment interval to distant metastases in early stage breast cancer patients. It was determined if expression analysis of the 11-gene signature would be informative in patient's stratification into sub-groups with distinct survival probability after therapy in the group of 125 patients diagnosed with lung adenocarcinoma (Bhattacharjee et al., 2001 Proc. Natl. Acad. Sci. USA 98:13790).
Figures 2A and 2B - 2D show the Kaplan-Meier survival curves for 97 breast cancer patients and ovarian cancer patients stratified into distinct sub-groups using weighted survival predictor score algorithm.
Similar to the prostate and breast cancer patients, the Kaplan-Meier analysis shows that patients with tumors displaying a stem cell-like expression profile of the 11-gene signature have significantly higher risk of death after therapy and therefore can be defined as a poor prognosis sub-group (data not shown). Median survival after therapy in the poor prognosis sub-group of lung adenocarcinoma patients defined by the 11 -gene BMI-I -pathway signature was 15.2 months (data not shown). In contrast, the median survival after therapy in the good prognosis sub-group was 48.8 months. 100 % of patients in the poor prognosis subgroup died within 3 years after therapy. Conversely, 58 % of patients in the good prognosis sub-group remained alive (data not shown). The estimated hazard ratio for death after therapy in the poor prognosis sub-group as compared with the good prognosis sub-group of patients defined by the 11-gene signature was 3.589 (95% confidence interval of ratio, 2.910 to 46.67; P = 0.0005).
It was determined whether the 11-gene MTTS/PNS signature would be useful in defining sub-groups of patients diagnosed with an early stage lung adenocarcinoma and having a statistically significant difference in the survival probability after therapy. In the group of patients diagnosed with the stage IA lung adenocarcinoma (data not shown), the median survival after therapy in the poor prognosis sub-group defined by the 11-gene signature was 49.6 months. 53 % of patients in the poor prognosis sub-group died within 5 years after therapy. In contrast, 92 % of patients remained alive in the good prognosis subgroup (data not shown). The estimated hazard ratio for death after therapy in the poor prognosis sub-group as compared with the good prognosis sub-group of patients defined by the 11-gene signature was 8.909 (95% confidence interval of ratio, 1.418 to 13.12; P = 0.01).
Based on this analysis we concluded that detection of a stem cell-like expression profile of the 11-gene MTTS/PNS signature in primary tumors from patients diagnosed with the early stage prostate, breast, and lung carcinomas is associated with a high propensity toward metastatic dissemination and significantly higher risk of poor therapy outcome. Interestingly, therapy outcome in cancer patients diagnosed with other types of epithelial cancers such as ovarian and bladder cancers seems to manifest similar association with distinct patterns of expression of the 11-gene signature (Figures 2A-D and data not shown).
Methods
Clinical Samples. Expression profiling data of primary tumor samples obtained from 1122 cancer patients representing therapy outcome cohorts for 10 types of human cancer (Table 2) were analyzed in this study. Microarray analysis and associated clinical information for 32 clinical samples (23 primary prostate tumors and 9 distant metastatic lesions) utilized to delineate the expression profiles of human prostate cancer metastases were reported previously (LaTulippe et al., 2002 Cancer Res. 62:4499). Two clinical outcome sets comprising 21 (outcome set 1) and 79 (outcome set 2) samples were utilized for analysis of the association of the therapy outcome with distinct expression profiles of the 11 -gene signature. Original gene expression profiles of the 21 clinical samples analyzed in this study were reported elsewhere (Singh et al., 2002 Cancer Cell 1 :203). Primary gene expression data files of clinical samples as well as associated clinical information can be found on the world wide web at genome.wi.mit.edu/cancer/.
Prostate tumor tissues comprising second clinical outcome set were obtained from 79 prostate cancer patients undergoing therapeutic or diagnostic procedures performed as part of routine clinical management at the Memorial Sloan-Kettering Cancer Center (MSKCC). Clinical and pathological features of 79 prostate cancer cases comprising validation outcome set are presented elsewhere (Glinsky et al., 2004 J. Clin. Invest. 113:913). Median follow-up after therapy in this cohort of patients was 70 months. Samples were snap-frozen in liquid nitrogen and stored at - 8O0C. Each sample was examined histologically using H&E-stained cryostat sections. Care was taken to remove nonneoplastic tissues from tumor samples. Cells of interest were manually dissected from the frozen block, trimming away other tissues. AU of the studies were conducted under MSKCC Institutional Review Board-approved protocols.
Expression analysis data for tumor samples obtained from 125 lung adenocarcinoma patients analyzed in this study as well as associated clinical information were reported elsewhere (Bhattacharjee et al., 2001 Proc Natl. Acad. Sci. USA 98:13790). Original work describing gene expression profiles of the early stage breast cancer set of 97 clinical samples analyzed in this study were reported elsewhere (van 't et al., 2002 supra). Primary gene expression data files of clinical samples as well as associated clinical information can be found on the world wide web at rii.com/publications/2002/vantveer.htm. To date our analysis includes 1122 therapy outcome samples from patients diagnosed with ten distinct types of cancers (Table 2): prostate cancer (100 patients); breast cancer (97 patients); lung adenocarcinoma (211 patients); ovarian cancer (50 patients); bladder cancer (31 patients); diffuse large B-cell lymphoma (DLBCL, 298 patients); mantle cell lymphoma (MCL, 92 - patients); mesothelioma (17 patients); medulloblastoma (60 patients); glioma (50 patients); acute myeloid leukemia (AML, 116 patients).
Cell Culture. Cell lines used in this study were previously described (Glinsky et al., 2003 MoI. Carcinog. 37:209). The LNCap- and PC-3-derived cell lines were developed by consecutive serial orthotopic implantation, either from metastases to the lymph node (for the LN series), or reimplanted from the prostate (Pro series). This procedure generated cell variants with differing tumorigenicity, frequency and latency of regional lymph node metastasis (Glinsky et al., 2003 MoI. Carcinog. 37:209). Except where noted, cell lines were grown in RPMIl 640 supplemented with 10% FBS and gentamycin (Gibco BRL) to 70-80% confluence and subjected to serum starvation as described (Glinsky et al., 2003 MoI. Carcinog. 37:209), or maintained in fresh complete media, supplemented with 10% FBS.
Anoikis assay. Cells were harvested by 5-min digestion with 0.25% trypsin/0.02% EDTA (Irvine Scientific, Santa Ana, CA, USA), washed and resuspended in serum free medium. Cells at a concentration 1.7 x 105 cells/well in 1 ml of serum free medium were plated in 24- well ultra low attachment polystyrene plates (Corning Inc., Corning, NY, USA) and incubated at 370C and 5% CO2 overnight. Viability of cell cultures subjected to anoikis assays were > 95% in Trypan blue dye exclusion test.
Apoptosis assay. Apoptotic cells were identified and quantified using the Annexin V-FITC kit (BD Biosciences Pharmingen, world wide web at bdbisciences.com) per manufacturer instructions. The following controls were used to set up compensation and quadrants: 1) Unstained cells; 2) Cells stained with Annexin V-FITC (no PI); 3) Cells stained with PI (no Annexin V-FITC). Each measurement was carried out in quadruplicate and each experiment was repeated at least twice. Annexin V-FITC positive cells were scored as early apoptotic cells; both Annexin V-FITC and PI positive cells were scored as late apoptotic cells; unstained Annexin V-FITC and PI negative cells were scored as viable or surviving cells. In selected experiments apoptotic cell death was documented using the TUNEL assay.
Flow cytometry. Cells were washed in cold PBS phosphate-buffered saline and stained according to manufacturer's instructions using the Annexin V-FITC Apoptosis Detection Kit (BD Biosciences, San Jose, CA, USA). Flow analysis was performed by a FACS Calibur instrument (BD Biosciences, San Jose, CA, USA). Cell Quest Software was used for data acquisition and analysis. All measurements were performed under the same instrument setting, analyzing 103 - 104 cells per sample.
Orthotopic xenografts. Orthotopic xenografts of human prostate PC-3 cells and sublines used in this study were developed by surgical orthotopic implantation as previously described (Glinsky et al., 2004 J. Clin. Invest. 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209). Briefly, 2 x 106 cultured PC-3 cells, PC-3M or PC-3MLN4 sublines were injected subcutaneously into male athymic mice, and allowed to develop into firm palpable and visible tumors over the course of 2 - 4 weeks. Intact tissue was harvested from a single subcutaneous tumor and surgically implanted in the ventral lateral lobes of the prostate gland in a series of six athymic mice per cell line subtype as described earlier (Glinsky et al., 2003 MoI. Carcinog. 37:209).
Transgenic mouse model of prostate cancer. A breeding colony of TRAMP (transgenic adenocarcinoma of the mouse prostate) mice are maintained on C57BL/6 background in the Animal Care Facility at the Sidney Kimmel Cancer Center (Baron et al., 2003 Oncogene 22:4194). The TRAMP mice colony is based on a breeding pair of TRAMP mice kindly provided by Norman Greenberg (Baylor College of Medicine, Houston, TX). Standard PCR assay was carried out for monitoring the presence of the SV40 large T-antigen in new litters. Twenty-one PCR-confϊrmed male TRAMP mice were defined for microarray analysis carried out in this study. Animals were killed at different age according to the established time course of the disease progression (Gingrich et al., 1996 Cancer Res. 56:4096), prostates as well as primary and metastatic tumors were immediately removed and snap frozen in liquid nitrogen. Prostate tissues from age-matched wild-type C57BL/6 mice served as control samples in a microarray analysis of the TRAMP model of prostate cancer. Necropsies with gross microscopic examination were carried out. All procedures were performed under IACUC approved protocols following Standard Operating Procedures in accordance with NIH Guide for the Care and Use of Laboratory Animals.
Tissue processing for mRNA and RNA isolation. Fresh frozen orthotopic and transgenic primary tumors, metastases, and mouse prostates were examined by use of hematoxylin and eosin stained frozen sections. Orthotopic tumors of all sublines exhibited similar morphology consisting of sheets of monotonous closely packed tumor cells with little evidence of differentiation interrupted by only occasional zones of largely stromal components, vascular lakes, or lymphocytic infiltrates. Fragments of tumor judged free of these non-epithelial clusters were used for mRNA preparation. Frozen tissue (1 - 3 mm x 1 - 3 mm) was submerged in liquid nitrogen in a ceramic mortar and ground to powder. The frozen tissue powder was dissolved and immediately processed for mRNA isolation using a FastTract kit for mRNA extraction (Invitrogen, Carlsbad, CA, see above) according to the manufacturer's instructions.
RNA and mRNA extraction. For gene expression analysis, cells were harvested in lysis buffer 2 hrs after the last media change at 70-80% confluence and total RNA or mRNA was extracted using the RNeasy (Qiagen, Chatsworth, CA) or FastTract kits (Invitrogen, Carlsbad, CA). Cell lines were not split more than 5 times prior to RNA extraction, except where noted.
Affymetrix arrays. The protocol for mRNA quality control and gene expression analysis was that recommended by Affymetrix (on the world wide web at affymetrix.com). In brief, approximately one microgram of mRNA was reverse transcribed with an oligo(dT) primer that has a T7 RNA polymerase promoter at the 5' end. Second strand synthesis was followed by cRNA production incorporating a biotinylated base. Hybridization to Affymetrix U95 Av2 arrays representing 12,625 transcripts overnight for 16 h was followed by washing and labeling using a fluorescently labeled antibody. The arrays were read and data processed using Affymetrix equipment and software as reported previously (LaTulippe et al, 2002, Cancer Res. 62:4736; Glinsky et al., J. Clin. Invest. 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209).
Data analysis. Detailed protocols for data analysis and documentation of the sensitivity, reproducibility and other aspects of the quantitative statistical microarray analysis using
Affymetrix technology have been reported (Glinsky et al., J. Clin. Invest. 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209). 40-50% of the surveyed genes were called present by the Affymetrix Microarray Suite 5.0 software in these experiments. The concordance analysis of differential gene expression across the data sets was performed using Affymetrix MicroDB v. 3.0 and DMT v.3.0 software as described earlier (LaTulippe et al., 2002, Cancer Res.
62:4736; Glinsky et al., J. Clin. Invest. 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209). The microarray data was processed using the Affymetrix Microarray Suite v.5.0 software and statistical analysis of expression data sets were processed using the Affymetrix MicroDB and Affymetrix DMT software. The Pearson correlation coefficient for individual test samples and appropriate reference standard was determined using the Microsoft Excel and the
GraphPad Prism version 4.00 software. The significance of the overlap between the lists of stem cell-associated and prostate cancer-associated genes was calculated by using the hypergeometric distribution test (Tavazoie et al., 1999, Nat. Genet. 22:281). Analytical protocol of identification and validation of the 11 -gene BMI-I -pathway signature is described below. The Multiple Experiments Viewer (MEV) software version 3.0.3 of the Institute for Genomic Research (TIGR) for Support Vector Machine (SVM) was used for classification and terrain (TRN) clustering algorithm data analysis and visualization. Protocol of discovery and validation of the 11-gene BMI-I -pathway signature.
It was determined if expression profiles of transcripts activated and suppressed in prostate cancer metastases would recapitulate the expression profile of the BMI-I -regulated genes in neural stem cells by comparing the sets of differentially regulated genes in search for union/intersections of lists for both up- and down-regulated transcripts. Thus, according to this model the primary criterion in transcript selection process should be the concordance of changes in expression rather than a magnitude of changes (e.g., fold change). One of the predictions of this model is that transcripts of interest would be expected to have a tightly controlled "rank order" of expression within a cluster of co-regulated genes reflecting a balance of up- and down-regulated mRNAs as a desired regulatory end-point in a cell. A degree of resemblance of the transcript abundance rank order within a gene cluster between a test sample and reference standard is measured by a Pearson correlation coefficient and designated as a phenotype association index (PAI). Samples with stem cell-resembling expression profiles (stem cell-like PAI or SPAI) are expected to have positive values of v Pearson correlation coefficients. Detailed prognostic signature identification and validation protocol is described (Data not shown).
Step 1. Sets of differentially regulated transcripts were independently identified for distant metastatic lesions and primary prostate tumors versus age-matched control samples in a transgenic TRAMP mouse model of metastatic prostate cancer (MTTS signature) as well as PNS (PNS signature) and CNS (CNS signature) neurospheres in BMI-1+/+ versus BMI-I'1' backgrounds using the Affymetrix microarray processing and statistical analysis software package (Affymetrix MAS 5.0; MicroDB™ Ver 3.0 and DMT 3.0 software) as described herein and in previous publications (Glinsky et al., 2004, J. Clin. Invest. 113: 913; Glinsky et al., 2003 MoI. Carcinog. 37:209). Transcripts with negative signal intensities values in both experimental and control sets were eliminated from further consideration. At least two-fold changes of the rnRNA abundance levels in experimental versus control samples for both up- regulated and down-regulated genes were required for inclusion in the lists of differentially regulated transcripts. Fold expression changes of the rnRNA abundance levels for each transcript were calculated as ratios of the average intensity values for a given transcript in experimental versus control samples for both up-regulated and down-regulated genes and loglO transformed for further analysis. Thus, this analytical step defined three large parent signatures (Data not shown): MTTS signature comprising 868 up-regulated and 477 down- regulated transcripts; PNS signature comprising 885 up-regulated and 1088 down-regulated transcripts; and CNS signature comprising 769 up-regulated and 778 down-regulated transcripts.
Step 2. Sub-sets of transcripts exhibiting concordant expression changes in metastatic TRAMP tumor samples (MTTS signature) as well as PNS (PNS signature) and CNS (CNS signature) neurospheres in BMT-1+/+ versus BMI-V1' backgrounds were identified.
Concordant lists of transcripts were obtained by intersections of the two lists of up-regulated and the two lists of down-regulated genes. Thus, two concordant sub-sets of transcripts were identified corresponding to each binary comparison of metastatic TRAMP tumors and neural stem cell samples in a state of PNS and CNS neurospheres (141 up-regulated and 58 down- regulated transcripts for PNS neurospheres (r = 0.7593; P < 0.0001) and 40 up-regulated and 24 down-regulated for CNS neurospheres (r = 0.7679; P < 0.0001)). A third concordant subset of 27 genes comprising 15 up-regulated and 12 down-regulated transcripts was selected for intersection common for all three signatures (r = 0.8002; P < 0.0001).
Step 3. Selection of small gene clusters was performed from sub-sets of genes exhibiting concordant changes of transcript abundance behavior in metastatic TRAMP tumor samples and PNS and CNS neurospheres in BMI-U1+ versus BMI-T1" backgrounds. Expression profiles were presented as LoglO average fold changes for each transcript and processed for visualization and Pearson correlation analysis using Microsoft Excel software. For the concordant differentially expressed genes vectors of loglO average fold change were determined for both experimental settings and the correlation between two vectors was computed. Practical considerations essential for future development of genetic diagnostic tests prompted selection from concordant gene sets small gene expression signatures comprising transcripts with high level of expression correlation in metastatic cancer cells and stem cells. The concordant list of differentially expressed genes was reduced by removing from the list genes whose removal lead to the largest increase in the correlation coefficient. The reduction in the signature transcript number was terminated when further elimination of a transcript did not increase the value of the Pearson correlation coefficient. Cut-off criterion for signature reduction was arbitrarily set to exceed a Pearson correlation coefficient 0.95 (P < 0.0001). Using this approach a single candidate prognostic gene expression signature was selected for each intersection of the MTTS signature and parent stem cell signatures (data not shown). Thus, three highly concordant small signatures were identified corresponding to three concordant sub-sets of genes defined in the Step 2 (a set of 11 genes comprising 8 up- regulated and 3 down-regulated transcripts for PNS neurospheres, 11-gene MTTS/PNS signature; a set of 11 genes comprising 7 up-regulated and 4 down-regulated transcripts for CNS neurospheres, 11 -gene MTTS/CNS signature; and a set of 14 genes comprising 8 up- regulated and 6 down-regulated transcripts, MTTS/PNS/CNS signature).
Step 4. The small signatures (one 11-gene signature for the PNS set, one 11 -gene signature for the CNS set, and one 14-gene signature for common PNS/CNS set) identified in Step 3 were tested for metastatic phenotype discriminative power (using one mouse prostate cancer data set and one human prostate cancer data set comprising primary and metastatic tumors) and therapy outcome classification performance (using human prostate cancer therapy outcome set 1). Three identified small signatures were evaluated for their ability to discriminate metastatic and primary prostate tumors in a TRAMP mouse model of prostate cancer, clinical samples of 9 metastatic versus 23 primary prostate tumors as well as primary prostate tumors from 21 patients with distinct outcome after the therapy (8 recurrent and 13 non-recurrent samples). To assess a potential diagnostic and prognostic relevance of small signatures, a Pearson correlation coefficient was calculated for each individual tumor sample by comparing the expression profiles of individual samples to the reference expression profile in either PNS or CNS neurospheres in BMI-I +/+ versus BMI-I -I- backgrounds. Fold expression changes in individual clinical samples were calculated for each gene as a ratio of the expression value in a given sample to the average expression value of the gene across the entire data set of clinical samples. For each data set, the vectors X of average gene expression were computed and then the vectors of ratios, R = XIX, were computed for each sample. The relative expression vectors R were loglO-transformed and correlated with the fixed vector of gene expression arising from Step 3. Negative expression values were treated as missing data. Based on expected correlation of expression profiles of identified gene clusters with stem cell-like expression profiles, the corresponding correlation coefficients calculated for individual samples were given the identifier of the stem cell-resembling phenotype association indices (SPAIs). The prognostic power of identified small signatures were evaluated based on their ability to discriminate the metastatic versus primary tumors (criterion 1) and to segregate the patients with recurrent and non-recurrent prostate tumors into distinct sub-groups (criterion 2). A single best performing small signature was selected for subsequent validation analysis (data not shown). Based on diagnostic and prognostic classification performance, a single best performing 11-gene MTTS/PNS signature was selected for further validation analysis (Data not shown). Step 5. To assess the incremental statistical power of the individual genetic and clinical covariates as predictors of therapy outcome and unfavorable prognosis in prostate cancer patients, both univariate and multivariate Cox proportional hazard survival analyses were performed (Table 4). Step 6. To validate a survival prediction model based on the 11-gene MTTS/PNS- signature, the prognostic performance of the model in the multiple independent therapy outcome data sets representing five epithelial and five non-epithelial cancers were tested. The patients were divided within individual cohorts into a training set, which were used for the cut-off threshold selection and to test the model, and a test set, which was used to evaluate the reproducibility of the classification performance. The training set was used to select the prognosis discrimination cut-off value for a signature based on highest level of statistical significance in patient's stratification into poor and good prognosis groups as determined by the log-rank test (lowest P value and highest hazard ratio in the training set). Clinical samples having the Pearson correlation coefficient at or higher than the cut-off value were identified as having the poor prognosis signature. Clinical samples with the Pearson correlation coefficient below the cut-off value were identified as having the good prognosis signature. Each training set was used to estimate a threshold of the correlation coefficients before performing a survival analysis. The same discrimination cut off value was then applied to evaluate the reproducibility of the prognostic performance in the test set of patients. Lastly, the model was applied to the entire outcome set using the same cut off threshold to confirm the classification performance. The average gene expression vectors were computed for each gene and applied separately on the training, test, and the combined data sets. The training and test sets were balanced with respect to the total number of patients, negative and positive therapy outcomes, and the length of survival. For breast cancer data set, the patients' distribution among training and test data sets described in the original publication (van 't Veer, LJ. et al., 2002, Nature 415:530) were maintained. At this stage of the analysis, additional model training, development or optimization steps, with the exception of a prognostic cut off threshold selection in a training set were not carried out. The same MTTS/PNS expression profile was consistently used throughout the study as a reference standard to quantify the Pearson correlation coefficients of the individual samples.
Step 7. The model performance was tested using various sample stratification approaches such as terrain (TRN) clustering (data not shown), support vector machine (SVM) classification (data not shown), and weighted survival score algorithm (Figures 1 and 2A). The therapy outcome predictive power of the 11 -gene model in prostate cancer setting was evaluated using prognostic test based on independent method of gene expression analysis, namely quantitative reverse-transcription polymerase chain reaction (Q-RT-PCR) method (data not shown). SPAI Index. Definition of the Pearson correlation coefficient as a phenotype association index (stem cell-resembling phenotype association indices (SPAIs)) is based on highly concordant behavior of the 11-gene signature between neural stem cells in the state of PNS neurospheres and prostate cancer metastasis (r = 0.9897; P < 0.0001, data not shown). A standard PNS neurosphere and TRAMP metastasis values were established (data not shown). They were used as uniform reference standards for measurements of Pearson correlation coefficients for clinical samples consistently throughout the study. A degree of resemblance of the transcript abundance rank order within a gene cluster between a test sample and reference standard is measured by a Pearson correlation coefficient and designated as a phenotype association index (PAI). Samples with stem cell-resembling expression profiles (stem cell-like PAI or SPAI) are expected to have positive values of Pearson correlation coefficients.
Random co-occurrence test. A 10,000 permutations test was performed to check the likelihood that small 11-gene signatures derived from the large MTTS signature would display high discrimination power to assess the significance at the 0.1% level. The sample stratification power of 10,000 permutations of small 11 -gene signatures derived from the large 1345-gene MTTS signature was compared to the 11-gene MTTS/PNS signature. The classification performance cut-off p-values were established by applying two-tailed T-test to the 11-gene MTTS/PNS signature (p = 0.0005 for metastasis versus primary prostate cancer data set and p = 0.026 for recurrent versus non-recurrent prostate cancer data set). Random concordant gene sets comprising -200 transcripts were generated using a mouse transcriptome data set representing expression profiling data of -12,000 transcripts across 45 normal tissues (Su et al., 2002 Proc. Natl. Acad. Sci. USA 99:4465). Inter- and intra-species array to array probe set match was performed at 95% or greater identity level using the Affymetrix data base (available on the world wide web at affymetrix.com). To assess discrimination of random 11-gene signatures derived from the 1345-gene MTTS signature two-tailed T-test was carried out for metastatic versus primary prostate cancer data set (32 samples) and recurrent versus non-recurrent prostate cancer data set (21 samples). The signatures were ranked based on p-values and ranking metrics of each random 11-gehe signature were compared to the 11 -gene MTTS/PNS signature p-values. 10,000 permutations were found to generate 7 random 11-gene signatures performing at sample classification level of the 11 -gene MTTS/PNS signature.
Weighted survival predictor score algorithm. The weighted survival score analysis was implemented to reflect the incremental statistical power of the individual covariates as predictors of therapy outcome based on a multi-component prognostic model. The microarray-based or Q-RT-PCR-derived gene expression values were normalized and log- transformed on a base 10 scale. The log-transformed normalized expression values for each data set were analyzed in a multivariate Cox proportional hazards regression model, with overall survival or event-free survival as the dependent variable. To calculate the survival/prognosis predictor score for each patient, the log-transformed normalized gene expression value measured for each gene was multiplied by a coefficient derived from the multivariate Cox proportional hazard regression analysis. Final survival predictor score comprises a sum of scores for individual genes and reflects the relative contribution of each of the eleven genes in the multivariate analysis. The negative weighting values indicate that higher expression correlates with longer survival and favorable prognosis, whereas the positive score values indicate that higher expression correlates with poor outcome and shorter survival. Thus, the weighted survival predictor model is based on a cumulative score of the weighted expression values of eleven genes. For example, the following equation is describing the relapse-free survival predictor score for prostate cancer patients (Table 4): relapse-free survival score = (-0.403xGbx2) + (1.2494xKI67) + (-0.3105xCyclin Bl) + (- 0.1226xBUBl) + (0.0077xHEC) + (0.0369xKIAA1063) + (-1.7493xHCFCl) + (- 1.1853xRNF2) + (1.5242xANK3) + (-0.5628xFGFR2) + (-0.4333xCESl).
BMI-I siRNA experiments. The target siRNA SMART pools for BMI-I and control lueiferase siRNAs were purchased from Dharmacon Research, Inc. They were transfected into PC-3-32 human prostate carcinoma cells according to the manufacturer's protocols. Cell cultures were continuously monitored for growth and viability and assayed for niRNA expression levels of BMI-I and selected set of genes (Table 2 and Figure 2) using RT-PCR and Q-RT-PCR methods. Quantitative RT-PCR analysis. The real time PCR methods measures the accumulation of PCR products by a fluorescence detector system and allows for quantification of the amount of amplified PCR products in the log phase of the reaction. Total RNA was extracted using RNeasy mini-kit (Qiagen, Valencia, CA, USA) following the manufacturer's instructions. A measure of 1 μg (tumor samples), or 2 μg and 4 μg (independent preparations of reference cDNA samples) of total RNA was used then as a template for cDNA synthesis with Superscript II (Invitrogen, Carlsbad, CA, USA). Q-PCR primer sequences were selected for each cDNA with the aid of Primer Express™ software (Applied Biosystems, Foster City, CA, USA). PCR amplification was performed with gene-specific primers (sequences presented in supplemental material for Glinsky et al., 2005 J. Clin. Invest. 115:1503).
Q-PCR reactions and measurements were performed with the SYBR-Green and ROX as a passive reference, using the ABI 7900 HT Sequence Detection System (Applied Biosystems, Foster City, CA, USA). Conditions for the PCR were as follows: one cycle of 10 min at 95°C; 40 cycles of 0.20 min at 94° C; 0.20 min at 6O0C and 0.30 min at 72°C. The results were normalized to the relative amount of expression of an endogenous control gene GAPDH.
Expression of messenger RNA (mRNA) for eleven genes and an endogenous control gene (GAPDH) was measured in twenty specimens of primary prostate cancer obtained from patients with documented PSA recurrence within five years after RP and patients who remained disease-free for at least five years after RP (ten patients in each group) by real-time PCR method on an ABI PRISM 7900 HT Sequence Detection System (Applied Biosystems). For each gene, at least two sets of primers were tested and the set-up with highest amplification efficiency was selected for the assay used in this study. Specificity of the assay for mRNA measurements was confirmed by the absence of the expected PCR products when genomic DNA was used as a template. Glyceraldehyde-3-phosphate dehydrogenase (GAPDH: 5'- CCCTCAACGACCACTTTGTCA-3' and 5'-
TTCCTCTTGTGCTCTTGCTGG- 3') was used as the endogenous RNA and cDNA quantity normalization control. For calibration and generation of standard curves, several reference cDNAs were used: cDNA prepared from primary in vitro cultures of normal human prostate epithelial cells (Glinsky et al., 2004 J. Clin. Invest 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209), cDNA derived from the PC-3M human prostate carcinoma cell line (Glinsky et al., 2004 J. Clin. Invest 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209), and cDNA prepared from normal human prostate (Glinsky et al., 2004 J. Clin. Invest 113:913; Glinsky et al., 2003 MoI. Carcinog. 37:209). Expression analysis of all genes was assessed in two independent experiments using reference cDNAs to control for variations among different Q-RT-PCR experiments. Prior to statistical analysis, the normalized gene expression values were log-transformed (on a base 10 scale) similarly to the transformation of the array- based gene expression data.
Survival analysis. Kaplan-Meier survival analysis was carried out using the GraphPad Prism version 4.00 software (GraphPad Software, San Diego, CA; available on the world wide web at graphpad.com). The end point for survival analysis in prostate cancer was the biochemical recurrence defined by the serum PSA increase after therapy. Disease-free interval (DFI) was defined as the time period between the date of radical prostatectomy (RP) and the date of PSA relapse (recurrence group) or date of last follow-up (non-recurrence group). Statistical significance of the difference between the survival curves for different groups of patients was assessed using Chi square and Log-rank tests. To evaluate the incremental statistical power of the individual covariates as predictors of therapy outcome and unfavorable prognosis, both univariate and multivariate Cox proportional hazard survival analyses were performed.
Table 1. Q-RT-PCR analysis of the BMI-I mRNA expression in human prostate carcinoma cell lines
Figure imgf000048_0001
Normalized average expression value from four measurements 2 Two-tailed T-test compared to the NPEC NPEC, normal human prostate epithelial cells Table 2. Cancer types and number of cancer patients in the therapy outcome sets analyzed in this study
Figure imgf000049_0001
MCL, mantle cell lymphoma; AML, acute myeloid leukemia.
Table 3. The 11-gene signature associated with poor prognosis of cancer patients diagnosed with multiple types of cancer
Figure imgf000049_0002
Figure imgf000050_0001
Legend: The UniGene IDs were updated to correspond to the UniGene cluster IDs in build 183.
Table 4A. Cox Pro ortional Hazard Survival Re ression Anal sis
Figure imgf000050_0002
Figure imgf000051_0001
RP, radical prostatectomy; PSA, prostate specific antigen; SM, surgical margins; GLSN SUM, Gleason sum; Sem Ves Inv, seminal vesicle invasion; ECE, extracapsular extension.
EXAMPLE 3 Cyclin Dl Signature
The methods of the invention as described in Examples 1 and 2 were used to validate the cyclin Dl signature presented in Lamb et al., 2003, Cell 114:323, incorporated by reference herein, and to identify subsets of genes useful for predicting a phenotype in a subject. Results are presented in Figures 3A-3Q-1. EXAMPLE 4
Myc Signature
The methods of the invention as described in Examples 1 and 2 were used to validate the myc signatures presented in Ellwood-Yen et al., 2003, Cancer Cell 4:223, incorporated by reference herein, and to identify subsets of genes useful for predicting a phenotype in a subject. Results are presented in Figures 4 A-4C-1.
EXAMPLE 5
Most Variable-Gene Signature
The methods of the invention as described in Examples 1 and 2 were used to validate the most variable gene signature presented in Cheung et al., 2003, Nature Genetics 33:422, incorporated by reference herein in its entirety, and to identify a subset of genes useful for predicting a phenotype in a subject. Results are presented in Figures 5A-5V-2. EXAMPLE 6 14q32ReguIon Signature
The methods of the invention as described in Examples 1 and 2 were used to validate the 14q32Regulon signature presented in Morley et al., 2004, Nature 430:743, incorporated by reference herein, and to identify subsets of genes useful for predicting a phenotype in a subject. Results are presented in Figures 6A-6K-1.
EXAMPLE 7 Suzl2 Signature
The methods of the invention as described in Examples 1 and 2 were used to validate the Suzl2 signatures presented in Kirmizis et al. et al., 2004, Genes and Development 18:15922, incorporated by reference herein, and to identify subsets of genes useful for predicting a phenotype in a subject. Results are presented in Figures 7A-7R-5.
All patents, patent applications and published references cited herein are hereby incorporated by reference in their entirety. While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

Claims

CLAIMSWhat is claimed is:
1. A method of generating a subset of genes for use in predicting a phenotype in a subject, comprising the steps of: a) obtaining a set of expression values for a set of genes in a first sample and a second sample by measuring the level of expression in said first sample and said second sample, b) identifying a set of genes that are differentially expressed by comparing said level of expression in said first sample with said level of expression in said second sample, wherein an expression value that is increased or decreased in said first sample, as compared to said second sample is differentially expressed; c) identifying a subset of genes for use in predicting a phenotype in a subject, wherein said subset is equal to or smaller than said set, by performing multivariate Cox analysis on said expression values for said set of genes which are differentially expressed.
2. The method of claim 1, further comprising the step of obtaining a relative weight coefficient for each member of said gene set.
3. The method of claim 1, further comprising the steps of: d) obtaining a relative weight coefficient for each member of said gene set; and e) multiplying said expression value by said relative weight coefficient to obtain an individual survival score for each member of said gene set. 4. The method of claim 3, wherein the sum of said individual survival scores is calculated to obtain a survival score.
5. The method of claim 1, further comprising the step of logarithmically transforming the expression value of each member of said gene set prior to performing said multivariate Cox analysis. 6. The method of claim 1, further comprising the steps of: b') logarithmically transforming the expression value of each member of said gene set; d) obtaining a relative weight coefficient for each member of said gene set; and e) multiplying said logarithmically transformed expression value by said relative weight coefficient to obtain an individual survival score for each member of said gene set.
7. The method of claim 6, wherein the sum of said individual survival scores is calculated to obtain a survival score. 8. The method of claim 1 , wherein a gene with a p-value, as determined by multivariate Cox analysis, that is less than or equal to 0.25 is included in said subset.
9. The method of claim 1, wherein a gene with a p-value, as determined by multivariate Cox analysis, that is less than or equal to 0.1 is included in said subset.
10. The method of claim 1, wherein a gene with a p-value, as determined by multivariate Cox analysis, that is less than or equal to 0.075 is included in said subset.
11. The method of claim 1, wherein a gene with a p-value, as determined by multivariate Cox analysis, that is less than or equal to 0.05 is included in said subset.
12. The method of claim 1, further comprising the step of performing survival analysis.
13. The method of claim 12, wherein said survival analysis is Kaplan-Meier analysis. 14. The method of claim 13, wherein a gene with a p-value, as determined by Kaplan-Meier analysis, that is less than or equal to 0.1, is included in said subset.
15. The method of claim 13, wherein a gene with a p-value as determined by Kaplan-Meier analysis that is less than or equal to 0.075 is included in said subset.
16. The method of claim 13, wherein a gene with a p-value as determined by Kaplan-Meier analysis that is less than or equal to 0.05 is included in said subset.
17. A method of generating a subset of genes for use in predicting a phenotype in a subject comprising the steps of: , a) obtaining a set of expression values for a set of genes in a first sample and a second sample by measuring the level of expression in said first sample and said second sample; b) identifying a set of genes that are differentially expressed by comparing said level of expression in said first sample with said level of expression in said second sample, wherein an expression value that is increased or decreased in said first sample, as compared to said second sample is differentially expressed; c) identifying a subset of genes for use in predicting a phenotype in a subject, wherein said subset is equal to or smaller than said set, by performing multivariate Cox analysis on said expression values for said set of genes which are differentially expressed; d) obtaining a relative weight coefficient for each member of said gene set; e) multiplying the expression value of each member of said gene set by said relative weight coefficient to obtain an individual survival score for each member of said set of genes; f) calculating the sum of said individual survival scores to obtain a survival score; and g) performing survival analysis.
18. The method of claim 17, wherein said survival analysis is Kaplan-Meier survival analysis. 19. The method of claim 18, wherein a gene with a p-value, as determined by Kaplan-Meier analysis, that is less than or equal to 0.1, is included in said subset.
20. The method of claim 18, wherein a gene with a p-value as determined by Kaplan-Meier analysis that is less than or equal to 0.075 is included in said subset.
21. The method of claim 18, wherein a gene with a p-value as determined by Kaplan-Meier analysis that is less than or equal to 0.05 is included in said subset.
22. The method of claim 17, further comprising the step of logarithmically transforming the expression value of each member of said gene set prior to performing said multivariate Cox analysis.
23. A method of identifying a subset of genes comprising the steps of: a) performing the method of claim 17; b) identifying genes with a p value as determined by said multivariate Cox analysis that is less than or equal to 0.25; c) obtaining a relative weight coefficient for each member of said gene set, d) multiplying the expression value of each member of said gene set by said relative weight coefficient to obtain an individual survival score for each member of said set of genes; e) calculating the sum of said individual survival scores to obtain a survival score; and f) performing survival analysis.
24. The method of claim 23, wherein a gene with a p-value determined by multivariate Cox analysis that is less than or equal to 0.1 is included in said subset.
25. The method of claim 23, wherein a gene with a p-value determined by multivariate Cox analysis that is less than or equal to 0.075 is included in said subset.
26. The method of claim 23, wherein a gene with a p-value determined by multivariate Cox analysis that is less than or equal to 0.05 is included in said subset. 27. The method of claim 23 wherein said survival analysis is Kaplan-Meier analysis.
28. The method of claim 27, wherein a gene with a p-value, as determined by Kaplan-Meier analysis, that is less than or equal to 0.1, is included in said subset.
29. The method of claim 27, wherein a gene with a p-value as determined by Kaplan-Meier analysis that is less than or equal to 0.075 is included in said subset. 30. The method of claim 27, wherein a gene with a p-value as determined by Kaplan-Meier analysis that is less than or equal to 0.05 is included in said subset.
31. The method of claim 23 wherein steps b through fare repeated.
32. A method of identifying a subset of genes comprising the steps of: a) performing the method of claim 17; b) identifying genes with a p value as determined by said multivariate Cox analysis that is less than or equal to 0.25; c) performing multivariate Cox analysis on said set of genes identified in step b, wherein a gene with a p-value that is less than or equal to 0.1 is included in said subset; d) obtaining a relative weight coefficient for each member of said gene set; e) multiplying the expression value of each member of said gene set by said relative weight coefficient to obtain an individual survival score for each member of said set of genes; f) calculating the sum of said individual survival scores to obtain a survival score; and g) performing survival analysis.
33. The method of claim 32, wherein steps b-g are repeated. 34. The method of claim 32, wherein said survival analysis is Kaplan-Meier analysis.
35. The method of claim 34, wherein a gene with a p-value, as determined by Kaplan-Meier analysis, that is less than or equal to 0.1, is included in said subset.
36. The method of claim 34, wherein a gene with a p- value as determined by Kaplan-Meier analysis that is less than or equal to 0.075 is included in said subset.
37. The method of claim 34, wherein a gene with a p-value as determined by Kaplan-Meier analysis that is less than or equal to 0.05 is included in said subset. 38. The method of any one of claims 1, 17, 23 and 32, wherein, said set of genes includes any of said sets identified in Figures 3-7 and Table 3.
39. The method of any one of claims 1, 17, 23 and 32, wherein said subset of genes includes at least one gene of any of said subsets identified in Figures 3-7 and Table 3.
40. A method of using a subset of genes to predict a phenotype in a subject comprising the steps of: a) isolating a sample from said subject; and b) analyzing said sample for expression of at least one member of said subset of genes.
41. The method of claim 40, wherein said phenotype is selected from the group consisting of disease outcome, diagnosis of a particular disease of interest, prognosis of a particular disease of interest, recurrence, non-recurrence, invasiveness, non-invasiveness, metastatic, non- metastatic, localized, organ confined, tumor grade, Gleason score, survival prognosis, lymph node status, tumor stage, degree of differentiation, age, hormone receptor status, PSA level, histologic type, and disease free survival, disease progression, remission, biochemical recurrence, metastatic recurrence, local recurrence, response to therapy, disease relapse, non- relapse, therapy failure and cure.
42. The method of claim 40, wherein said subset of genes is any one of said sets or subsets identified in Figures 3-7 and Table 3.
43. A method of determining the relevance of a set of genes comprising the steps of: a) obtaining a set of expression values for a set of genes in a first sample and a second sample by measuring the level of expression in said first sample and said second sample; b) identifying a set of genes that are differentially expressed by comparing said level of expression in said first sample with said level of expression in said second sample, wherein an expression value that is increased or decreased in said first sample, as compared to said second sample is differentially expressed; c) identifying a subset of genes for use in predicting a phenotype in a subject, wherein said subset is equal to or smaller than said set, by performing multivariate Cox analysis on said expression values for said set of genes which are differentially expressed; d) obtaining a relative weight coefficient for each member of said gene set; e) multiplying the expression value of each member of said gene set by said relative weight coefficient to obtain an individual survival score for each member of said set of genes; f) calculating the sum of said individual survival scores to obtain a survival score; and g) performing survival analysis.
44. A subset set of genes comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3 45. A subset of genes for use in predicting a phenotype of a subject comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3.
46. A subset of genes comprising at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3, wherein said subset of genes is generated by the method of any one of claims 1 , 17, 23 or 32.
47. A composition comprising a set of probes that hybridize to at least two of the genes presented in any one of the gene sets presented in Figures 3-7 and Table 3.
48. A subset of genes generated by the method of any one of claims 1, 17, 23 or 32.
49. A combination of gene subsets, wherein said combination comprises at least two of the subsets presented in Figures 3-7 and Table 3.
50. The combination of gene subsets of claim 46, wherein each subset of said combination comprises at least one gene of any of said subsets identified in Figures 3-7 and Table 3.
51. A kit comprising at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3-7 and Table 3. 52. A kit comprising a set of reagents for detecting the expression of at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3-7 and Table 3.
53. The kit of claim 52, wherein the kit comprises a set of probes that hybridize to at least two of the genes presented in any one of the gene sets or subsets presented in Figures 3-7 and Table 3.
4. The kit of claim 52 or 53, wherein said kit predicts the phenotype of a subject.
PCT/US2006/037916 2005-09-29 2006-09-29 Methods of identification and use of gene signatures WO2007041238A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US72187505P 2005-09-29 2005-09-29
US60/721,875 2005-09-29

Publications (3)

Publication Number Publication Date
WO2007041238A2 WO2007041238A2 (en) 2007-04-12
WO2007041238A9 true WO2007041238A9 (en) 2007-06-07
WO2007041238A3 WO2007041238A3 (en) 2009-04-30

Family

ID=37906713

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2006/037916 WO2007041238A2 (en) 2005-09-29 2006-09-29 Methods of identification and use of gene signatures

Country Status (1)

Country Link
WO (1) WO2007041238A2 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
PL3084664T3 (en) * 2013-12-16 2020-10-05 Philip Morris Products S.A. Systems and methods for predicting a smoking status of an individual
JP2020028278A (en) * 2018-08-24 2020-02-27 国立大学法人九州大学 Method for generating classifier for predicting event occurring in subject, and method for stratifying subject using classifier
CN112771618B (en) * 2019-09-02 2022-08-16 北京哲源科技有限责任公司 Disease treatment management factor characteristic automatic prediction method and electronic equipment
EP4352518A2 (en) * 2021-06-03 2024-04-17 Apexigen America, Inc. Methods of treating cancer with cd-40 agonists

Also Published As

Publication number Publication date
WO2007041238A2 (en) 2007-04-12
WO2007041238A3 (en) 2009-04-30

Similar Documents

Publication Publication Date Title
US8349555B2 (en) Methods and compositions for predicting death from cancer and prostate cancer survival using gene expression signatures
Riester et al. Combination of a novel gene expression signature with a clinical nomogram improves the prediction of survival in high-risk bladder cancer
Lee et al. Prediction of recurrence-free survival in postoperative non–small cell lung cancer patients by using an integrated model of clinical information and gene expression
Andres et al. Interrogating differences in expression of targeted gene sets to predict breast cancer outcome
DK2382331T3 (en) CANCER biomarkers
JP4938672B2 (en) Methods, systems, and arrays for classifying cancer, predicting prognosis, and diagnosing based on association between p53 status and gene expression profile
US20110251087A1 (en) Prognostic and diagnostic method for cancer therapy
Chang et al. Comparison of genomic signatures of non-small cell lung cancer recurrence between two microarray platforms
Schramm et al. Review and cross-validation of gene expression signatures and melanoma prognosis
JP6280206B2 (en) Prognosis prediction system for locally advanced gastric cancer
KR20140105836A (en) Identification of multigene biomarkers
Miller et al. A novel mapk–microrna signature is predictive of hormone-therapy resistance and poor outcome in er-positive breast cancer
EP1552293A2 (en) Gene segregation and biological sample classification methods
US20090098538A1 (en) Prognostic and diagnostic method for disease therapy
WO2015017537A2 (en) Colorectal cancer recurrence gene expression signature
WO2016011558A1 (en) Systems, devices and methods for constructing and using a biomarker
CA2660857A1 (en) Prognostic and diagnostic method for disease therapy
Agulló-Ortuño et al. Lung cancer genomic signatures
WO2011153325A2 (en) Gene expression profiling for predicting the response to immunotherapy and/or the survivability of melanoma subjects
Brennan et al. Contribution of DNA and tissue microarray technology to the identification and validation of biomarkers and personalised medicine in breast cancer
CA2889276A1 (en) Method for identifying a target molecular profile associated with a target cell population
Dwivedi et al. Application of single-cell omics in breast cancer
WO2007041238A9 (en) Methods of identification and use of gene signatures
Yang et al. LncRNA MSC-AS1 is a diagnostic biomarker and predicts poor prognosis in patients with gastric cancer by integrated bioinformatics analysis
EP2872651B1 (en) Gene expression profiling using 5 genes to predict prognosis in breast cancer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06815706

Country of ref document: EP

Kind code of ref document: A2