WO2014130617A1

WO2014130617A1 - Method of predicting breast cancer prognosis

Info

Publication number: WO2014130617A1
Application number: PCT/US2014/017279
Authority: WO
Inventors: Michael R. CRAGER; Kunbin Qu; George Andrew Watson; Samuel Levy
Original assignee: Genomic Health, Inc.
Priority date: 2013-02-22
Filing date: 2014-02-20
Publication date: 2014-08-28

Abstract

The present invention relates to biomarkers associated with breast cancer prognosis. These biomarkers include coding transcripts and their expression products, as well as non-coding transcripts, and are useful for predicting the likelihood of breast cancer recurrence in a breast cancer patient.

Description

METHOD OF PREDICTING BREAST CANCER PROGNOSIS

FIELD OF THE INVENTION

[0001] The present invention relates to biomarkers associated with breast cancer prognosis. These biomarkers include coding transcripts and their expression products, as well as non-coding transcripts, and are useful for predicting the likelihood of breast cancer recurrence in a breast cancer patient.

INTRODUCTION

[0002] For over a decade, technologies such as DNA microarray and reverse transcription polymerase chain reaction (RT-PCR) have demonstrated that levels of certain RNA transcripts ("gene expression profiles") relate to patient stratification and disease outcomes, especially in a variety of cancers. Several validated and now widely used clinical tests make use of gene expression profiling, such as the Oncotype DX^® RT-PCR test, which measures the levels of 21 biomarker RNAs in archival formalin-fixed paraffin-embedded (FFPE) tissue. The Oncotype DX^® RT-PCR test predicts the risk of recurrence of early estrogen receptor (ER)-positive breast cancer, as well as the likelihood of response to chemotherapy, and is now used to guide treatment decisions for about half of ER+ breast cancer patients in the U.S.

[0003] However, RT-PCR is constrained by the number of transcripts and sequence complexity that can be interrogated, especially given the limited amount of patient FFPE RNA available from many tumor specimens. Recent major advances in DNA sequencing ("next generation sequencing") provide massively parallel throughput and data volumes that eclipse the nucleic acid information content possible with other technologies, such as RT-PCR. Next generation sequencing of cellular RNA (RNA-seq) makes feasible unprecedented extensive genome analysis of groups of individuals, including analyses of sequence differences, polymorphisms, mutations, copy number variations, epigenetic variations and transcript abundance.

SUMMARY

[0004] A multiplexed, whole genome sequencing methodology was used to enable whole transcriptome breast cancer biomarker discovery using low amounts of FFPE tissue. The present invention provides biomarkers that associate, positively or negatively, with a particular clinical outcome in breast cancer in a way that adds value for prediction of outcome over and above an existing multi-gene assay, the Oncotype DX Recurrence Score . These biomarkers are listed in Tables 2-7. For example, the clinical outcome could be no cancer recurrence or cancer recurrence. The clinical outcome may be defined by clinical endpoints, such as disease or recurrence free survival, metastasis free survival, overall survival, etc.

[0005] The present invention accommodates the use of archived paraffin- embedded biopsy material for assay of all markers in the set, and therefore is compatible with the most widely available type of biopsy material. It is also compatible with other different methods of tumor tissue harvest, for example, via core biopsy or fine needle aspiration.

[0006] In one aspect, the present invention provides a method of predicting a likelihood of long-term survival without recurrence of breast cancer in a breast cancer patient. The method comprises determining a level of one or more RNA transcripts, or its expression product, in a breast cancer tumor sample obtained from the patient. The RNA transcript or its expression product may be selected from Table 2. The likelihood of long-term survival without breast cancer recurrence is then predicted based on the negative or positive correlation of the RNA transcript or its expression product with increased likelihood of long- term survival without breast cancer recurrence. An RNA transcript is negatively correlated with increased long-term survival without recurrence of breast cancer if its direction of association is marked "+" in Table 2, and is positively correlated with increased long-term survival without recurrence of breast cancer if its direction of association is marked "-" in Table 2.

[0007] In yet another aspect, the present invention provides a method of predicting a likelihood of long-term survival without recurrence of breast cancer in a breast cancer patient by determining a level of one or more non-coding sequences in a breast cancer tissue sample obtained from the patient. In one embodiment, the non-coding sequence is one or more intronic RNAs selected from Table 4. In a further embodiment, the non-coding sequence is one or more intergenic sequences selected from Table 6. The likelihood of long- term survival without breast cancer recurrence is predicted based on the negative or positive correlation of the non-coding sequence with increased likelihood of long-term survival without breast cancer recurrence. A non-coding sequence is negatively correlated with increased long-term survival without recurrence of breast cancer if its direction of association is marked "+" in Tables 4 and 6, and is positively correlated with increased long-term survival without recurrence of breast cancer if its direction of association is marked "-" in Tables 4 and 6.

[0008] Any of the above methods may utilize a combination of coding and non- coding RNA transcripts for predicting breast cancer prognosis. Moreover, any of the above methods may be performed by whole transcriptome sequencing, reverse transcription polymerase chain reaction (RT-PCR), or by array. Other methods known in the art may be used. In an embodiment of the invention, the breast cancer tumor sample is a fixed, wax- embedded tissue sample or a fine needle biopsy sample. In another embodiment, the level of the RNA transcript, or its expression product, or the level of the non-coding sequence may be normalized.

[0009] In an embodiment of the invention, a likelihood score (e.g., a score predicting a likelihood of long-term survival without breast cancer recurrence) can be calculated based on the level or normalized level of the coding RNA transcript, or an expression product thereof, and/or non-coding RNA transcript. A score may be calculated using weighted values based on the level or normalized level of the coding RNA transcript (or expression product thereof) and/or the non-coding RNA transcript, and its contribution to clinical outcome, such as long- term survival without breast cancer recurrence.

DETAILED DESCRIPTION

[0010] Before the present invention and specific exemplary embodiments of the invention are described, it is to be understood that this invention is not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.

[0011] Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either both of those included limits are also included in the invention. [0012] As used herein and in the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to "an RNA transcript" includes a plurality of such RNA transcripts.

[0013] Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. For example, Singleton et al., Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, NY 1994), provide one skilled in the art with a general guide to many of the terms used in the present application.

[0014] Additionally, the practice of the present invention will employ, unless otherwise indicated, conventional techniques of molecular biology (including recombinant techniques), microbiology, cell biology, and biochemistry, which are within the skill of the art. Such techniques are explained fully in the literature, such as, "Molecular Cloning: A Laboratory Manual", 2^nd edition (Sambrook et al., 1989); "Oligonucleotide Synthesis" (M.J. Gait, ed., 1984); "Animal Cell Culture" (R.I. Freshney, ed., 1987); "Methods in

Enzymology" (Academic Press, Inc.); "Handbook of Experimental Immunology", 4^th edition (D.M. Weir & C.C. Blackwell, eds., Blackwell Science Inc., 1987); "Gene Transfer Vectors for Mammalian Cells" (J.M. Miller & M.P. Calos, eds., 1987); "Current Protocols in

Molecular Biology" (F.M. Ausubel et al., eds., 1987); and "PCR: The Polymerase Chain Reaction", (Mullis et al., eds., 1994).

[0015] The terms "cancer" and "cancerous" refer to or describe the physiological condition in mammals that is typically characterized by unregulated cell growth. An example of a cancer is breast cancer.

[0016] The term "correlates" or "correlating" as used herein refers to a statistical association between instances of two events, where events may include numbers, data sets, and the like. For example, when the events involve numbers, a positive correlation (also referred to herein as a "direct correlation") means that as one increases, the other increases as well. A negative correlation (also referred to herein as an "inverse correlation") means that as one increases, the other decreases. The present invention provides coding and non-coding RNA transcripts, or expression products thereof, the levels of which are correlated with a particular outcome measure, such as between the level of an RNA transcript and the likelihood of long-term survival without breast cancer recurrence. For example, the increased level of an RNA transcript may be positively correlated with a likelihood of a good clinical outcome for the patient, such as an increased likelihood of long-term survival without recurrence and/or a positive response to a chemotherapy, and the like. Such a positive correlation may be demonstrated statistically in various ways, e.g. by a low hazard ratio. In another example, the increased level of an RNA transcript may be negatively correlated with a likelihood of good clinical outcome for the patient. In this case, for example, the patient may have a decreased likelihood of long-term survival without recurrence of the cancer and/or a positive response to a chemotherapy, and the like. Such a negative correlation indicates that the patient likely has a poor prognosis or will respond poorly to a

chemotherapy, and this may be demonstrated statistically in various ways, e.g., by a high hazard ratio.

[0017] As used herein, the term "exon" refers to any segment of an interrupted gene that is represented in the mature RNA product (B. Lewin. Genes IV Cell Press, Cambridge Mass. 1990). As used herein, the terms "intron" and "intronic sequence" refer to any non-coding region found within genes.

[0018] The term "expression product" as used herein refers to an expression product of a coding RNA transcript. Thus, the term refers to a polypeptide or protein.

[0019] As used herein, the term "intergenic region" refers to a stretch of DNA or RNA sequences that is located between RefSeq identified genes. Intergenic regions are different from intragenic regions (or "introns"), which are non-coding regions that are found between exons within genes. An intergenic region may be comprised of one or more "intergenic sequences." As shown in the Examples below, five intergenic sequences were found to add prognostic information to the approximate Recurrence Score and correlate to long-term survival without breast cancer recurrence. Two of those intergenic sequences were newly identified. The intergenic sequences are readily available from publicly available information. For example, the UCSC Genome Browser available at

http://genome.ucsc.edu/cgi-bin/hgGateway allows inputting of the coordinates, such as the chromosome number and the start stop positions on the chromosome shown in Tables 6 and 7, to produce an output comprising that sequence.

[0020] As used herein, the terms "long intergenic non-coding RNAs" and "lincRNAs" are used interchangeably and refer to non-coding transcripts that are typically longer than 200 nucleotides.

[0021] As used herein, the term "level" as used herein refers to qualitative or quantitative determination of the number of copies of a coding or non-coding RNA transcript or a polypeptide/protein. An RNA transcript or a polypeptide/protein exhibits an "increased level" when the level of the RNA transcript or polypeptide/protein is higher in a first sample, such as in a clinically relevant subpopulation of patients (e.g., patients who have experienced cancer recurrence), than in a second sample, such as in a related subpopulation (e.g., patients who did not experience cancer recurrence). In the context of an analysis of a level of an RNA transcript or a polypeptide/protein in a tumor sample obtained from an individual patient, an RNA transcript or polypeptide/protein exhibits "increased level" when the level of the RNA transcript or polypeptide/protein in the subject trends toward, or more closely approximates, the level characteristic of a clinically relevant subpopulation of patients.

[0022] Thus, for example, when the RNA transcript analyzed is an RNA transcript that shows an increased level in subjects that experienced long-term survival without cancer recurrence as compared to subjects that did not experience long-term survival without cancer recurrence, then an "increased" level of a given RNA transcript can be described as being positively correlated with a likelihood of long-term survival without cancer recurrence. If the level of the RNA transcript in an individual patient being assessed trends toward a level characteristic of a subject who experienced long-term survival without cancer recurrence, the level of the RNA transcript supports a determination that the individual patient is more likely to experience long-term survival without cancer recurrence. If the level of the RNA transcript in the individual patient trends toward a level characteristic of a subject who experienced cancer recurrence, then the level of the RNA transcript supports a determination that the individual patient is more likely to experience cancer recurrence.

[0023] The term "likelihood score" is an arithmetically or mathematically calculated numerical value for aiding in simplifying or disclosing or informing the analysis of more complex quantitative information, such as the correlation of certain levels of the disclosed RNA transcripts, their expression products, or gene networks to a likelihood of a certain clinical outcome in a breast cancer patient, such as likelihood of long-term survival without breast cancer recurrence. A likelihood score may be determined by the application of a specific algorithm. The algorithm used to calculate the likelihood score may group the RNA transcripts, or their expression products, into gene networks. A likelihood score may be determined for a gene network by determining the level of one or more RNA transcripts, or an expression product thereof, and weighting their contributions to a certain clinical outcome such as recurrence. A likelihood score may also be determined for a patient. In an embodiment, a likelihood score is a recurrence score, wherein an increase in the recurrence score negatively correlates with an increased likelihood of long-term survival without breast cancer recurrence. In other words, an increase in the recurrence score correlates with bad prognosis. Examples of methods for determining the likelihood score or recurrence score are disclosed in U.S. Patent No. 7,526,387.

[0024] The term "long-term" survival as used herein refers to survival for at least 3 years. In other embodiments, it may refer to survival for at least 5 years, or for at least 10 years following surgery or other treatment.

[0025] As used herein, the term "normalized" with regard to a coding or non- coding RNA transcript, or an expression product of the coding RNA transcript, refers to the level of the RNA transcript, or its expression product, relative to the mean levels of transcript/product of a set of reference RNA transcripts, or their expression products. The reference RNA transcripts, or their expression products, are based on their minimal variation across patients, tissues, or treatments. Alternatively, the coding or non-coding RNA transcript, or its expression product, may be normalized to the totality of tested RNA transcripts, or a subset of such tested RNA transcripts.

[0026] As used herein, the term "pathology" of cancer includes all phenomena that comprise the well-being of the patient. This includes, without limitation, abnormal or uncontrollable cell growth, metastasis, interference with the normal functioning of neighboring cells, release of cytokines or other secretory products at abnormal levels, suppression or aggravation of inflammatory or immunological response, neoplasia, premalignancy, malignancy, invasion of surrounding or distant tissues or organs, such as lymph nodes.

[0027] A "patient response" may be assessed using any endpoint indicating a benefit to the patient, including, without limitation, (1) inhibition, to some extent, of tumor growth, including slowing down and complete growth arrest; (2) reduction in the number of tumor cells; (3) reduction in tumor size; (4) inhibition (i.e., reduction, slowing down or complete stopping) of tumor cell infiltration into adjacent peripheral organs and/or tissues; (5) inhibition (i.e. reduction, slowing down or complete stopping) of metastasis; (6) enhancement of anti-tumor immune response, which may, but does not have to, result in the regression or rejection of the tumor; (7) relief, to some extent, of one or more symptoms associated with the cancer; (8) increase in the length of survival following treatment; and/or (9) decreased mortality at a given point of time following treatment.

[0028] The term "prognosis" as used herein, refers to the prediction of the likelihood of cancer-attributable death or progression, including recurrence, metastatic spread, and drug resistance, of neoplastic disease, such as breast cancer. The term

"prediction" is used herein to refer to the likelihood that a patient will respond either favorably or unfavorably to a drug or set of drugs, and also the extent of those responses, or that a patient will survive, following surgical removal of the primary tumor and/or chemotherapy for a certain period of time without cancer recurrence. The methods of the present invention can be used clinically to make treatment decisions by choosing the most appropriate treatment modalities for any particular patient. The methods of the present invention are tools in predicting if a patient is likely to respond favorably to a treatment regimen, such as surgical intervention, chemotherapy with a given drug or drug combination, and/or radiation therapy, or whether long-term survival of the patient without cancer recurrence is likely, following surgery and/or termination of chemotherapy or other treatment modalities.

[0029] The term "breast cancer prognostic biomarker" refers to an RNA transcript, or an expression product thereof, intronic RNA, lincRNA, and/or intergenic sequence, found to be associated with long term survival without breast cancer recurrence as disclosed herein.

[0030] The term "reference" RNA transcript or an expression product thereof, as used herein, refers to an RNA transcript or an expression product thereof, whose level can be used to compare the level of an RNA transcript or its expression product in a test sample. In an embodiment of the invention, reference RNA transcripts include housekeeping genes, such as beta-globin, alcohol dehydrogenase, or any other RNA transcript, the level or expression of which does not vary depending on the disease status of the cell containing the RNA transcript or its expression product. In another embodiment, all of the assayed RNA transcripts, or their expression products, or a subset thereof, may serve as reference RNA transcripts or reference RNA expression products.

[0031] As used herein, the term "RefSeq RNA" refers to an RNA that can be found in the Reference Sequence (RefSeq) database, a collection of publicly available nucleotide sequences and their protein products built by the National Center for

Biotechnology Information (NCBI). The RefSeq database provides an annotated, non- redundant record for each natural biological molecule (i.e. DNA, RNA or protein) included in the database. Thus, a sequence of a RefSeq RNA is well-known and can be found in the RefSeq database at http://www.ncbi.nlm.nih.gov/RefSeq/. See also Pruitt et al., Nucl. Acids Res. 33(Supp 1):D501-D504 (2005). Accession numbers for each RefSeq, which include accession numbers for any alternative splice forms, are provided in Tables 2 and 3. The intronic sequences for a RefSeq are also publicly available. Nonetheless, the coordinates for each intronic sequence listed in Tables 4 and 5. Tables 4 and 5 also provide accession numbers for any alternative splice forms. Therefore, the sequence of each RNA sequence in Tables 2-5 are readily available from publicly available sources.

[0032] As used herein, the term "RNA transcript" refers to the RNA transcription product of DNA and includes coding and non-coding RNA transcripts. RNA transcripts include, for example, mRNA, an unspliced RNA, a splice variant mRNA, a microRNA, fragmented RNA, long intergenic non-coding RNAs (lincRNAs), intergenic RNA sequences, and intronic RNAs.

[0033] The terms "subject," "individual," and "patient" are used interchangeably herein to refer to a mammal being assessed for treatment and/or being treated. In an embodiment, the mammal is a human. The terms "subject," "individual," and "patient" thus encompass individuals having cancer (e.g., breast cancer), including those who have undergone or are candidates for resection (surgery) to remove cancerous tissue.

[0034] As used herein, the term "surgery" applies to surgical methods undertaken for removal of cancerous tissue, including mastectomy, lumpectomy, lymph node removal, sentinel lymph node dissection, prophylactic mastectomy, prophylactic ovary removal, cryotherapy, and tumor biopsy. The tumor samples used for the methods of the present invention may have been obtained from any of these methods.

[0035] The term "tumor" as used herein, refers to all neoplastic cell growth and proliferation, whether malignant or benign, and all pre-cancerous and cancerous cells and tissues.

[0036] The term "tumor sample" as used herein refers to a sample comprising tumor material obtained from a cancer patient. The term encompasses tumor tissue samples, for example, tissue obtained by surgical resection and tissue obtained by biopsy, such as for example, a core biopsy or a fine needle biopsy. In a particular embodiment, the tumor sample is a fixed, wax-embedded tissue sample, such as a formalin-fixed, paraffin-embedded tissue sample. Additionally, the term "tumor sample" encompasses a sample comprising tumor cells obtained from sites other than the primary tumor, e.g., circulating tumor cells. The term also encompasses cells that are the progeny of the patient's tumor cells, e.g. cell culture samples derived from primary tumor cells or circulating tumor cells. The term further encompasses samples that may comprise protein or nucleic acid material shed from tumor cells in vivo, e.g., bone marrow, blood, plasma, serum, and the like. The term also encompasses samples that have been enriched for tumor cells or otherwise manipulated after their procurement and samples comprising polynucleotides and/or polypeptides that are obtained from a patient' s tumor material.

[0037] As used herein, "whole transcriptome sequencing" refers to the use of high throughput sequencing technologies to sequence the entire transcriptome in order to get information about a sample's RNA content. Whole transcriptome sequencing can be done with a variety of platforms for example, the Genome Analyzer (Illumina, Inc., San Diego, CA), the SOLiD™ Sequencing System (Life Tehcnologies, Carlsbad, CA), Ion Torrent (Life Tehcnologies, Carlsbad, CA), and GS FLX and GS Junior Systems (454 Life Sciences, Roche, Branford, CT). However, any platform useful for whole transcriptome sequencing may be used.

[0038] The term "RNA-Seq" or "transcriptome sequencing" refers to sequencing performed on RNA (or cDNA) instead of DNA, where typically, the primary goal is to measure expression levels, detect fusion transcripts, alternative splicing, and other genomic alterations that can be better assessed from RNA. RNA-Seq includes whole transcriptome sequencing as well as target specific sequencing.

[0039] The term "computer-based system," as used herein, refers to the hardware means, software means, and data storage means used to analyze information. The minimum hardware of a patient computer-based system comprises a central processing unit (CPU), input means, output means, and data storage means. A skilled artisan can readily appreciate that many of the currently available computer-based system are suitable for use in the present invention and may be programmed to perform the specific measurement and/or calculation functions of the present invention.

[0040] To "record" data, programming or other information on a computer readable medium refers to a process for storing information, using any such methods as known in the art. Any convenient data storage structure may be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g. word processing text file, database format, etc.

[0041] A "processor" or "computing means" references any hardware and/or software combination that will perform the functions required of it. For example, any processor herein may be a programmable digital microprocessor such as available in the form of an electronic controller, mainframe, server or personal computer (desktop or portable). Where the processor is programmable, suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based). For example, a magnetic medium or optical disk may carry the programming, and can be read by a suitable reader communicating with each processor at its corresponding station.

[0042] The present invention provides RNA transcripts that are prognostic for breast cancer. These RNA transcripts are listed in Tables 2-7 and include coding and non-coding RNA transcripts. An RNA transcript, or an expression product thereof, is negatively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the RNA transcript is marked "+" in Tables 2-7, and is positively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the RNA transcript is marked "-" in Tables 2-7.

[0043] The present invention provides methods that utilize the RNA transcripts and associated information. For example, the present invention provides a method of predicting a likelihood that a breast cancer patient will exhibit long-term survival without breast cancer recurrence. The methods of the invention comprise determining the level of at least one RNA transcript, or an expression product thereof, in a tumor sample, and determining the likelihood of long-term survival without breast cancer recurrence based on the correlation between the level of the RNA transcript, or its expression product, and long-term survival without breast cancer recurrence.

[0044] For all aspects of the present invention, the methods may further include determining the level of at least two RNA transcripts, or their expression products. It is also contemplated that the methods of the present invention may further include determining the level of at least three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty-one, twenty-two, twenty-three, twenty-four, or twenty-five of the RNA transcripts, or their expression products. The methods of the present invention may include measuring all of the RNA transcripts, or their expression products, provided in Table 2 and/or 3. It is further contemplated that the method of the invention may further include determining the level of at least 2 to at least 10, at least 10 to at least 20, at least 20 to at least 30, at least 30 to at least 40, at least 40 to at least 50, at least 50 to at least 60, at least 60 to at least 70, at least 70 to at least 80, at least 80 to at least 90, at least 90 to at least 100, at least 100 to 115 of the RNA transcripts, or their expression products. For example, the levels of at least two RNA transcripts, or their expression products, selected from Table 2 and/or 3 may be determined. Furthermore, for example, the levels of BAG1, Bcl2, CCNB1, CD68, CEGP1, CTSL2, EstRl, GRB7, GSTM1 , HER2, Ki-67, MYBL2, PR, STK15, STMY3, SURV, or their expression products, and one or more RNA transcript, or its expression product, selected from Table 2 and/or Table 3 may be determined.

[0045] Similarly, the methods may include determining the level of one or more non- coding coding RNA transcripts. It is also contemplated that the methods of the present invention may further include determining the level of at least two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty-one, twenty-two, twenty-three, twenty-four, or twenty-five of the non- coding RNA transcripts. The methods of the present invention may include measuring all of the non-coding RNA transcripts provided in Table 4, 5, 6, and/or 7. It is further contemplated that the method of the invention may further include determining the level of at least 2 to at least 10, at least 10 to at least 20, at least 20 to at least 30, at least 30 to at least 40, at least 40 to at least 50, at least 50 to at least 60, at least 60 to at least 70, at least 70 to at least 80, at least 80 to at least 90, at least 90 to 93 of the non-coding RNA transcripts. For example, the levels of at least two non-coding RNA transcripts selected from Table 4, 5, 6, and/or 7 may be determined. Furthermore, for example, the levels of BAG1, Bcl2, CCNB1, CD68, CEGP1, CTSL2, EstRl, GRB7, GSTM1, HER2, Ki-67, MYBL2, PR, STK15, STMY3, SURV, or their expression products, and one or more non-coding RNA transcript selected from Table 4, 5, 6, and/or 7 may be determined.

[0046] Coding and non-coding RNA transcripts may be combined in any of the methods described herein.

[0047] The RNA transcripts and associated information provided by the present invention also have utility in the development of therapies to treat cancers and screening patients for inclusion in clinical trials. The RNA transcripts and associated information may further be used to design or produce a reagent that modulates the level or activity of the RNA transcript or its expression product. Such reagents may include, but are not limited to, a drug, an antisense RNA, a small inhibitory RNA (siRNA), a ribozyme, a small molecule, a monoclonal antibody, and a polyclonal antibody.

[0048] In various embodiments of the methods of the present invention, various technological approaches are available for determining the levels of the RNA transcripts, including, without limitation, whole transcriptome sequencing, RT-PCR, microarrays, and serial analysis of gene expression (SAGE), which are described in more detail below. CORRELATING LEVEL OF AN RNA TRANSCRIPT OR AN EXPRESSION PRODUCT TO A

CLINICAL OUTCOME

[0049] One skilled in the art will recognize that there are many statistical methods that may be used to determine whether there is a correlation between an outcome of interest (e.g., likelihood of survival) and levels of RNA transcripts or their expression products as described here. This relationship can be presented as a continuous recurrence score (RS), or patients may be stratified into risk groups (e.g., low, intermediate, high). For example, a Cox proportional hazards regression model may fit to a particular clinical endpoint (e.g., RFI, DFS, OS). One assumption of the Cox proportional hazards regression model is the proportional hazards assumption, i.e. the assumption that effect parameters multiply the underlying hazard.

Assessments of model adequacy may be performed including, but not limited to, examination of the cumulative sum of martingale residuals. One skilled in the art would recognize that there are numerous statistical methods that may be used (e.g., Royston and Parmer (2002), smoothing spline, etc.) to fit a flexible parametric model using the hazard scale and the Weibull distribution with natural spline smoothing of the log cumulative hazards function, with effects for treatment (chemotherapy or observation) and RS allowed to be time-dependent. (See, e.g., P. Royston, M. Parmer, Statistics in Medicine 21(15:2175-2197 (2002).)

[0050] In an embodiment, power calculations may be carried out for the Cox proportional hazards model with a single non-binary covariate using the method proposed by F. Hsieh and P. Lavori, Control Clin Trials 21 :552-560 (2000) as implemented in PASS 2008.

[0051] The coding and non-coding RNA transcripts, and any expression products thereof, of the present invention are listed in Tables 2-7. In an embodiment of the invention, a level of one or more RNA transcripts, or an expression product thereof, listed in Tables 2 or 3, is negatively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the RNA transcript is marked "+" in Table 2 or 3, and is positively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the RNA transcript is marked "-" in Tables 2 or 3.

[0052] In a further embodiment of the invention, a level of an intronic RNA selected from Table 4 or 5 is negatively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the intronic RNA is marked "+" in Table 4 or 5, and is positively correlated with an increased likelihood of long- term survival without recurrence of breast cancer if the direction of association of the intronic RNA is marked "-" in Table 4 or 5.

[0053] In another embodiment, a level of one or more intergenic sequence listed in Table 6 or 7 is negatively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the intergenic sequence or intergenic region is marked "+" in Table 6 or 7, and is positively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the intergenic sequence or intergenic region is marked "-" in Table 6 or 7.

[0054] In yet another embodiment, a likelihood score is determined for assessing the likelihood of a certain clinical outcome in a breast cancer patient, such as likelihood of long- term survival without breast cancer recurrence. A likelihood score may be calculated by determining the level of one or more RNA transcripts, or its expression product, selected from Tables 2-3, and mathematically weighting its contribution to the clinical outcome.

METHODS TO PREDICT LIKELIHOOD OF LONG-TERM SURVIVAL WITHOUT BREAST CANCER RECURRENCE

[0055] As described above, a number of coding and non-coding RNA transcripts that correlate with breast cancer prognosis were identified. The levels of these RNA transcripts, or their expression products, can be determined in a tumor sample obtained from an individual patient who has breast cancer and for whom treatment is being contemplated. Depending on the outcome of the assessment, treatment with chemotherapy may be indicated, or an alternative treatment regimen may be indicated.

[0056] In carrying out the method of the present invention, a tumor sample is assayed or measured for a level of an RNA transcript, or its expression product. The tumor sample can be obtained from a solid tumor, e.g., via biopsy, or from a surgical procedure carried out to remove a tumor; or from a tissue or bodily fluid that contains cancer cells. In an embodiment of the invention, the tumor sample is obtained from a patient with breast cancer, such as ER- positive breast cancer. In another embodiment, the level of an RNA transcript, or its expression product, is normalized relative to the level of one or more reference RNA transcripts, or its expression product.

[0057] In an embodiment of the invention, the likelihood of long-term survival without breast cancer recurrence in an individual patient is predicted by comparing, directly or indirectly, the level or normalized level of the RNA transcript, or its expression product, in the tumor sample from the individual patient to the level or normalized level of the RNA transcript, or its expression product, in a clinically relevant subpopulation of patients. Thus, as explained above, when the RNA transcript, or its expression product, analyzed is an RNA transcript, or an expression product, that shows increased level in subjects that experienced long-term survival without breast cancer recurrence as compared to subjects that experienced breast cancer recurrence, then if the level of the RNA transcript, or its expression product in an individual patient being assessed trends toward a level characteristic of a subject with long-term survival without breast cancer recurrence, then the RNA transcript or its expression product level supports a determination that the individual patient is more likely to experience long-term survival without breast cancer recurrence. Similarly, where the RNA transcript or its expression product analyzed is an RNA transcript or expression product that is increased in subjects who have experienced breast cancer recurrence as compared subjects who have experienced long-term survival without breast cancer recurrence, then if the level of the RNA transcript, or its expression product, in an individual patient being assessed trends toward a level characteristic of a subject with breast cancer recurrence, then RNA transcript or expression product level supports a determination that the individual patient will more likely experience breast cancer recurrence. Thus, the level of a given RNA transcript, or its expression product, can be described as being positively correlated with a likelihood of long- term survival without breast cancer recurrence, or as being negatively correlated with a likelihood of long-term survival without breast cancer recurrence.

[0058] It is understood that the level or normalized level of an RNA transcript, or its expression product, from an individual patient can be compared, directly or indirectly, to the level or normalized level of the RNA transcript, or its expression product, in a clinically relevant subpopulation of patients. For example, when compared indirectly, the level or normalized level of the RNA transcript, or its expression product, from the individual patient may be used to calculate a likelihood of long-term survival without breast cancer recurrence, such as a likelihood/recurrence score (RS) as described herein, and compared to a calculated score in the clinically relevant subpopulation of patients.

METHODS OF ASSAYING LEVELS OF RNA TRANSCRIPTS OR THEIR EXPRESSION PRODUCTS

[0059] Methods of expression profiling include methods based on sequencing of polynucleotides, methods based on hybridization analysis of polynucleotides, and proteomics- based methods. Representative methods for sequencing-based analysis include Massively Parallel Sequencing (see e.g., Tucker et al., The American J. Human Genetics 85: 142-154, 2009) and Serial Analysis of Gene Expression (SAGE). Exemplary methods known in the art for the quantification of mRNA expression in a sample include northern blotting and in situ hybridization (Parker & Barnes, Methods in Molecular Biology 106:247-283 (1999)); RNAse protection assays (Hod, Biotechniques 13:852-854 (1992)); and PCR-based methods, such as reverse transcription polymerase chain reaction (RT-PCR) (Weis et al., Trends in Genetics 8:263-264 (1992)). Antibodies may be employed that can recognize sequence-specific duplexes, including DNA duplexes, RNA duplexes, and DNA-RNA hybrid duplexes or DNA- protein duplexes.

Nucleic Acid Sequencin2-Based Methods

[0060] Nucleic acid sequencing technologies are suitable methods for expression analysis. The principle underlying these methods is that the number of times a cDNA sequence is detected in a sample is directly related to the relative RNA levels corresponding to that sequence. These methods are sometimes referred to by the term Digital Gene Expression (DGE) to reflect the discrete numeric property of the resulting data. Early methods applying this principle were Serial Analysis of Gene Expression (SAGE) and Massively Parallel Signature Sequencing (MPSS). See, e.g., S. Brenner, et al., Nature Biotechnology 18(6):630- 634 (2000).

[0061] More recently, the advent of "next-generation" sequencing technologies has made DGE simpler, higher throughput, and more affordable. As a result, more laboratories are able to utilize DGE to screen the expression of more nucleic acids in more individual patient samples than previously possible. See, e.g., J. Marioni, Genome Research 18(9): 1509-1517 (2008); R. Morin, Genome Research 18(4) :610-621 (2008); A. Mortazavi, Nature Methods 5(7):621-628 (2008); N. Cloonan, Nature Methods 5(7):613-619 (2008). Massively parallel sequencing methods have also enabled whole genome or transcriptome sequencing, allowing the analysis of not only coding but also non-coding sequencees. As reviewed in Tucker et al., The American J. Human Genetics 85: 142-154 (2009), there are several commercially available massively parallel sequencing platforms, such as the Illumina Genome Analyzer (Illumina, Inc., San Diego, CA), Applied Biosystems SOLiD™ Sequencer (Life Technologies, Carlsbad, CA), Roche GS-FLX 454 Genome Sequencer (Roche Applied Science, Germany), and the Helicos® Genetic Analysis Platform (Helicos Biosciences Corp., Cambridge, MA). Other developing technologies may be used.

Reverse Transcription PCR (RT-PCR)

[0062] The starting material is typically total RNA isolated from a human tumor, usually from a primary tumor. Optionally, normal tissues from the same patient can be used as an internal control. RNA can be extracted from a tissue sample, e.g., from a sample that is fresh, frozen (e.g. fresh frozen), or paraffin-embedded and fixed (e.g. formalin-fixed).

[0063] General methods for RNA extraction are well known in the art and are disclosed in standard textbooks of molecular biology, including Ausubel et al., Current Protocols of Molecular Biology, John Wiley and Sons (1997). Methods for RNA extraction from paraffin embedded tissues are disclosed, for example, in Rupp and Locker, Lab Invest. 56:A67 (1987), and De Andres et al., BioTechniques 18:42044 (1995). In particular, RNA isolation can be performed using a purification kit, buffer set and protease from commercial manufacturers, such as Qiagen, according to the manufacturer's instructions. For example, total RNA from cells in culture can be isolated using Qiagen RNeasy mini-columns. Other commercially available RNA isolation kits include MasterPure™ Complete DNA and RNA Purification Kit (EPICENTRE®, Madison, WI), and Paraffin Block RNA Isolation Kit (Ambion, Inc.). Total RNA from tissue samples can be isolated using RNA Stat-60 (Tel-Test). RNA prepared from a tumor sample can be isolated, for example, by cesium chloride density gradient centrifugation. The isolated RNA may then be depleted of ribosomal RNA as described in U.S. Pub. No. 2011/0111409.

[0064] The sample containing the RNA is then subjected to reverse transcription to produce cDNA from the RNA template, followed by exponential amplification in a PCR reaction. The two most commonly used reverse transcriptases are avian myeloblastosis virus reverse transcriptase (AMV-RT) and Moloney murine leukemia virus reverse transcriptase (MMLV-RT). The reverse transcription step is typically primed using specific primers, random hexamers, or oligo-dT primers, depending on the circumstances and the goal of expression profiling. For example, extracted RNA can be reverse-transcribed using a

Gene Amp RNA PCR kit (Perkin Elmer, CA, USA), following the manufacturer's instructions. The derived cDNA can then be used as a template in the subsequent PCR reaction.

[0065] PCR-based methods use a thermostable DNA-dependent DNA polymerase, such as a Taq DNA polymerase. For example, TaqMan® PCR typically utilizes the 5 '-nuclease activity of Taq or Tth polymerase to hydrolyze a hybridization probe bound to its target amplicon, but any enzyme with equivalent 5' nuclease activity can be used. Two

oligonucleotide primers are used to generate an amplicon typical of a PCR reaction product. A third oligonucleotide, or probe, can be designed to facilitate detection of a nucleotide sequence of the amplicon located between the hybridization sites of the two PCR primers. The probe can be detectably labeled, e.g., with a reporter dye, and can further be provided with both a fluorescent dye, and a quencher fluorescent dye, as in a Taqman® probe configuration. Where a Taqman® probe is used, during the amplification reaction, the Taq DNA polymerase enzyme cleaves the probe in a template-dependent manner. The resultant probe fragments disassociate in solution, and signal from the released reporter dye is free from the quenching effect of the second fluorophore. One molecule of reporter dye is liberated for each new molecule synthesized, and detection of the unquenched reporter dye provides the basis for quantitative interpretation of the data.

[0066] TaqMan® RT-PCR can be performed using commercially available equipment, such as, for example, ABI PRISM 7700™ Sequence Detection System™ (Perkin-Elmer- Applied Biosystems, Foster City, CA, USA), or Lightcycler (Roche Molecular Biochemicals, Mannheim, Germany). In a preferred embodiment, the 5' nuclease procedure is run on a realtime quantitative PCR device such as the ABI PRISM 7700™ Sequence Detection System™. The system consists of a thermocycler, laser, charge-coupled device (CCD), camera and computer. The system amplifies samples in a 384-well format on a thermocycler. The RT- PCR may be performed in triplicate wells with an equivalent of 2ng RNA input per 10 μί- reaction volume. During amplification, laser-induced fluorescent signal is collected in real-time through fiber optics cables for all wells, and detected at the CCD. The system includes software for running the instrument and for analyzing the data.

[0067] 5'-Nuclease assay data are generally initially expressed as a threshold cycle ("C_t")- Fluorescence values are recorded during every cycle and represent the amount of product amplified to that point in the amplification reaction. The threshold cycle (C_t) is generally described as the point when the fluorescent signal is first recorded as statistically significant.

[0068] To minimize errors and the effect of sample-to-sample variation, RT-PCR is usually performed using an internal standard. The ideal internal standard gene (also referred to as a reference gene) is expressed at a constant level among cancerous and non-cancerous tissue of the same origin (i.e., a level that is not significantly different among normal and cancerous tissues), and is not significantly affected by the experimental treatment (i.e., does not exhibit a significant difference in expression level in the relevant tissue as a result of exposure to chemotherapy). RNAs most frequently used to normalize patterns of gene expression are mRNAs for the housekeeping genes glyceraldehyde-3-phosphate-dehydrogenase (GAPDH) and β-actin. Gene expression measurements can be normalized relative to the mean of one or more (e.g., 2, 3, 4, 5, or more) reference genes. Reference-normalized expression measurements can range from 0 to 15, where a one unit increase generally reflects a 2-fold increase in RNA quantity.

[0069] Real time PCR is compatible both with quantitative competitive PCR, where an internal competitor for each target sequence is used for normalization, and with quantitative comparative PCR using a normalization gene contained within the sample, or a housekeeping gene for RT-PCR. For further details see, e.g. Held et al, Genome Research 6:986-994 (1996).

Design of PCR Primers and Probes

[0070] PCR primers and probes can be designed based upon exon, intron, or intergenic sequences present in the RNA transcript of interest. Primer/probe design can be performed using publicly available software, such as the DNA BLAT software developed by Kent, W.J., Genome Res. 12(4):656-64 (2002), or by the BLAST software including its variations.

[0071] Where necessary or desired, repetitive sequences of the target sequence can be masked to mitigate non-specific signals. Exemplary tools to accomplish this include the Repeat Masker program available on-line through the Baylor College of Medicine, which screens DNA sequences against a library of repetitive elements and returns a query sequence in which the repetitive elements are masked. The masked sequences can then be used to design primer and probe sequences using any commercially or otherwise publicly available primer/probe design packages, such as Primer Express (Applied Biosystems); MGB assay-by- design (Applied Biosystems); Primer3 (Steve Rozen and Helen J. Skaletsky (2000) Primer3 on the WWW for general users and for biologist programmers. In: Rrawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in Molecular Biology. Humana Press, Totowa, NJ, pp 365-386).

[0072] Other factors that can influence PCR primer design include primer length, melting temperature (Tm), and G/C content, specificity, complementary primer sequences, and 3 '-end sequence. In general, optimal PCR primers are generally 17-30 bases in length, and contain about 20-80%, such as, for example, about 50-60% G+C bases, and exhibit Tm's between 50 and 80 °C, e.g. about 50 to 70 °C.

[0073] For further guidelines for PCR primer and probe design see, e.g. Dieffenbach, CW. et al, "General Concepts for PCR Primer Design" in: PCR Primer, A Laboratory Manual, Cold Spring Harbor Laboratory Press,_. New York, 1995, pp. 133-155; Innis and Gelfand, "Optimization of PCRs" in: PCR Protocols, A Guide to Methods and Applications, CRC Press, London, 1994, pp. 5-11 ; and Plasterer, T.N. Primerselect: Primer and probe design. Methods Mol. Biol. 70:520-527 (1997), the entire disclosures of which are hereby expressly

incorporated by reference.

MassARRAY® System

[0074] In MassARRAY-based methods, such as the exemplary method developed by Sequenom, Inc. (San Diego, CA) following the isolation of RNA and reverse transcription, the obtained cDNA is spiked with a synthetic DNA molecule (competitor), which matches the targeted cDNA region in all positions, except a single base, and serves as an internal standard. The cDNA/competitor mixture is PCR amplified and is subjected to a post-PCR shrimp alkaline phosphatase (SAP) enzyme treatment, which results in the dephosphorylation of the remaining nucleotides. After inactivation of the alkaline phosphatase, the PCR products from the competitor and cDNA are subjected to primer extension, which generates distinct mass signals for the competitor- and cDNA-derived PCR products. After purification, these products are dispensed on a chip array, which is pre-loaded with components needed for analysis with matrix- assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) analysis. The cDNA present in the reaction is then quantified by analyzing the ratios of the peak areas in the mass spectrum generated. For further details see, e.g. Ding and Cantor, Proc. Natl. Acad. Sci. USA 100:3059-3064 (2003).

Other PCR-based Methods

[0075] Further PCR-based techniques that can find use in the methods disclosed herein include, for example, BeadArray® technology (Illumina, San Diego, CA; Oliphant et al., Discovery of Markers for Disease (Supplement to Biotechniques), June 2002; Ferguson et al., Analytical Chemistry 72:5618 (2000)); BeadsArray for Detection of Gene Expression® (BADGE), using the commercially available LuminexlOO LabMAP® system and multiple color-coded microspheres (Luminex Corp., Austin, TX) in a rapid assay for gene expression (Yang et al., Genome Res. 11 :1888-1898 (2001)); and high coverage expression profiling (HiCEP) analysis (Fukumura et al., Nucl. Acids. Res. 31(16) e94 (2003).

Microarrays

[0076] In this method, polynucleotide sequences of interest (including cDNAs and oligonucleotides) are arrayed on a substrate. The arrayed sequences are then contacted under conditions suitable for specific hybridization with detectably labeled cDNA generated from RNA of a sample. The source of RNA typically is total RNA isolated from a tumor sample, and optionally from normal tissue of the same patient as an internal control or cell lines. RNA can be extracted, for example, from frozen or archived paraffin-embedded and fixed (e.g. formalin-fixed) tissue samples.

[0077] For example, PCR amplified inserts of cDNA clones of a gene to be assayed are applied to a substrate in a dense array. Usually at least 10,000 nucleotide sequences are applied to the substrate. For example, the microarrayed genes, immobilized on the microchip at 10,000 elements each, are suitable for hybridization under stringent conditions. Fluorescently labeled cDNA probes may be generated through incorporation of fluorescent nucleotides by reverse transcription of RNA extracted from tissues of interest. Labeled cDNA probes applied to the chip hybridize with specificity to each spot of DNA on the array. After washing under stringent conditions to remove non-specifically bound probes, the chip is scanned by confocal laser microscopy or by another detection method, such as a CCD camera. Quantitation of hybridization of each arrayed element allows for assessment of corresponding mRNA abundance.

[0078] With dual color fluorescence, separately labeled cDNA probes generated from two sources of RNA are hybridized pair wise to the array. The relative abundance of the transcripts from the two sources corresponding to each specified gene is thus determined simultaneously. The miniaturized scale of the hybridization affords a convenient and rapid evaluation of the expression pattern for large numbers of genes. Such methods have been shown to have the sensitivity required to detect rare transcripts, which are expressed at a few copies per cell, and to reproducibly detect at least approximately two-fold differences in the expression levels (Schena et at, Proc. Natl. Acad. Sci. USA 93(2): 106-149 (1996)). Microarray analysis can be performed on commercially available equipment, following the manufacturer's protocols, such as by using the Affymetrix GenChip® technology, or Incyte's microarray technology.

Isolating RNA from Body Fluids

[0079] Methods of isolating RNA for expression analysis from blood, plasma and serum {see for example, Tsui NB et al. (2002) Clin. Chem. 48,1647-53 and references cited therein) and from urine {see for example, Boom R et al. (1990) J Clin Microbiol. 28, 495-503 and reference cited therein) have been described.

Immunohistochemistry

[0080] Immunohistochemistry methods are also suitable for detecting the expression levels of genes and applied to the method disclosed herein. Antibodies (e.g., monoclonal antibodies) that specifically bind a gene product of a gene of interest can be used in such methods. The antibodies can be detected by direct labeling of the antibodies themselves, for example, with radioactive labels, fluorescent labels, hapten labels such as biotin, or an enzyme such as horse radish peroxidase or alkaline phosphatase. Alternatively, unlabeled primary antibody can be used in conjunction with a labeled secondary antibody specific for the primary antibody. Immunohistochemistry protocols and kits are well known in the art and are commercially available.

Proteomics

[0081] The term "proteome" is defined as the totality of the proteins present in a sample (e.g. tissue, organism, or cell culture) at a certain point of time. Proteomics includes, among other things, study of the global changes of protein expression in a sample (also referred to as "expression proteomics"). Proteomics typically includes the following steps: (1) separation of individual proteins in a sample by 2-D gel electrophoresis (2-D PAGE); (2) identification of the individual proteins recovered from the gel, e.g. my mass spectrometry or N- terminal sequencing, and (3) analysis of the data using bioinformatics.

GENERAL DESCRIPTION OF THE RNA ISOLATION AND PREPARATION FROM FIXED,

PARAFFIN-EMBEDDED SAMPLES FOR WHOLE TRANSCRIPTOME SEQUENCING

[0082] The steps of a representative protocol for profiling gene expression levels using fixed, paraffin-embedded tissues as the RNA source are provided in various published journal articles. (See, e.g., T.E. Godfrey et al,. /. Molec. Diagnostics 2: 84-91 (2000); K. Specht et al., Am. J. Pathol. 158: 419-29 (2001), M. Cronin, et al., Am J Pathol 164:35-42 (2004)). Modified methods can used for whole transcriptome sequencing as described in the Examples section. Briefly, a representative process starts with cutting a tissue sample section (e.g. about 10 μιη thick sections of a paraffin-embedded tumor tissue sample). The RNA is then extracted, and ribosomal RNA may be deleted as described in U.S. Pub. No. 2011/0111409. cDNA sequencing libraries may be prepared that are directional and single or paired-end using commercially available kits such as the ScriptSeq™ mRNA-Seq Library Preparation Kit (Epicenter Biotechnologies, Madison, WI). The libraries may also be barcoded for multiplex sequencing using commercially available barcode primers such as the RNA-Seq Barcode Primers from Epicenter Biotechnologies (Madison, WI). PCR is then carried out to generate the second strand of cDNA to incorporate the barcodes and to amplify the libraries. After the libraries are quantified, the sequencing libraries may be sequenced as described herein.

COEXPRESSION ANALYSIS [0083] To perform particular biological processes, genes often work together in a concerted way, i.e. they are co-expressed. Co-expressed gene networks identified for a disease process like cancer can also serve as prognostic biomarkers. Such co-expressed genes can be assayed in lieu of, or in addition to, assaying the biomarker with which they co-express.

[0084] One skilled in the art will recognize that many co-expression analysis methods now known or later developed will fall within the scope and spirit of the present invention. These methods may incorporate, for example, correlation coefficients, co-expression network analysis, clique analysis, etc., and may be based on expression data from RT-PCR, microarrays, sequencing, and other similar technologies. For example, gene expression clusters can be identified using pair-wise analysis of correlation based on Pearson or Spearman correlation coefficients. (See e.g, Pearson K. and Lee A., Biometrika 2:357 (1902); C. Spearman, Amer. J. Psychol. 15:72-101 (1904); J. Myers, A. Well, Research Design and Statistical Analysis, p. 508 (2^nd Ed., 2003).) In general, a correlation coefficient of equal to or greater than 0.3 is considered to be statistically significant in a sample size of at least 20. (See e.g., G. Norman, D. Streiner, Biostatistics: The Bare Essentials, 137-138 (3^rd Ed. 2007).)

REFERENCE NORMALIZATION

[0085] In order to minimize expression measurement variations due to non-biological variations in samples, e.g., the amount and quality of product to be measured, the level of an RNA transcript or its expression product may be normalized relative to the mean levels obtained for one or more reference RNA transcripts or their expression products. Examples of reference RNA transcripts or expression products include housekeeping genes, such as GAPDH. Alternatively, all of the assayed RNA transcripts or expression products, or a subset thereof, may also serve as reference. On a transcript (or protein) -by-transcript (or protein) basis, measured normalized amount of a patient tumor RNA or protein may be compared to the amount found in a cancer tissue reference set. See e.g., Cronin, M. et al., Am. Soc. Investigative Pathology 164:35-42 (2004). The normalization may be carried out such that a one unit increase in normalized level of an RNA transcript or expression product generally reflects a 2- fold increase in quantity present in the sample.

KITS OF THE INVENTION

[0086] The materials for use in the methods of the present invention are suited for preparation of kits produced in accordance with well known procedures. The present invention thus provides kits comprising agents, which may include primers and/or probes, for quantitating the level of the disclosed RNA transcripts or their expression products via methods such as whole transcriptome sequencing or RT-PCR for predicting prognostic outcome. Such kits may optionally contain reagents for the extraction of RNA from tumor samples, in particular, fixed paraffin-embedded tissue samples and/or reagents for whole transcriptome sequencing. In addition, the kits may optionally comprise the reagent(s) with an identifying description or label or instructions relating to their use in the methods of the present invention. The kits may comprise containers (including microliter plates suitable for use in an automated implementation of the method), each with one or more of the various reagents (typically in concentrated form) utilized in the methods, including, for example, pre-fabricated microarrays, buffers, the appropriate nucleotide triphosphates (e.g., dATP, dCTP, dGTP and dTTP; or rATP, rCTP, rGTP and UTP), reverse transcriptase, DNA polymerase, RNA polymerase, and one or more probes and primers of the present invention (e.g., appropriate length poly(T) or random primers linked to a promoter reactive with the RNA polymerase). Mathematical algorithms used to estimate or quantify prognostic information are also potential components of kits.

REPORTS

[0087] The methods of this invention are suited for the preparation of reports summarizing the predictions resulting from the methods of the present invention. A "report" as described herein, is an electronic or tangible document that includes elements that provide information of interest relating to a likelihood assessment and its results. A subject report includes at least a likelihood assessment, e.g., an indication as to the likelihood that a cancer patient will exhibit long-term survival without breast cancer recurrence. A subject report can be completely or partially electronically generated, e.g., presented on an electronic display (e.g., computer monitor). A report can further include one or more of: 1) information regarding the testing facility; 2) service provider information; 3) patient data; 4) sample data; 5) an interpretive report, which can include various information including: a) indication; b) test data, where test data can include a normalized level of one or more RNA transcripts of interest, and 6) other features.

[0088] The present invention therefore provides methods of creating reports and the reports resulting therefrom. The report may include a summary of the levels of the RNA transcripts, or the expression products of such RNA transcripts, in the cells obtained from the patient' s tumor sample. The report may include a prediction that the patient has an increased likelihood of long-term survival without breast cancer recurrence or the report may include a prediction that the subject has a decreased likelihood of long-term survival without breast cancer recurrence. The report may include a recommendation for a treatment modality such as surgery alone or surgery in combination with chemotherapy. The report may be presented in electronic format or on paper.

[0089] Thus, in some embodiments, the methods of the present invention further include generating a report that includes information regarding the patient' s likelihood of long- term survival without breast cancer recurrence. For example, the methods of the present invention can further include a step of generating or outputting a report providing the results of a patient response likelihood assessment, which can be provided in the form of an electronic medium (e.g., an electronic display on a computer monitor), or in the form of a tangible medium (e.g., a report printed on paper or other tangible medium).

[0090] A report that includes information regarding the likelihood that a patient will exhibit long-term survival without breast cancer recurrence, is provided to a user. An assessment as to the likelihood that a cancer patient will exhibit long-term survival without breast cancer reucrrence, is referred to as a "likelihood assessment." A person or entity who prepares a report ("report generator") may also perform the likelihood assessment. The report generator may also perform one or more of sample gathering, sample processing, and data generation, e.g., the report generator may also perform one or more of: a) sample gathering; b) sample processing; c) measuring a level of an RNA transcript or its expression product; d) measuring a level of a reference RNA transcript or its expression product; and e) determining a normalized level of an RNA transcript or its expression product. Alternatively, an entity other than the report generator can perform one or more sample gathering, sample processing, and data generation.

[0091] The term "user" or "client" refers to a person or entity to whom a report is transmitted, and may be the same person or entity who does one or more of the following: a) collects a sample; b) processes a sample; c) provides a sample or a processed sample; and d) generates data for use in the likelihood assessment. In some cases, the person or entity who provides sample collection and/or sample processing and/or data generation, and the person who receives the results and/or report may be different persons, but are both referred to as "users" or "clients." In certain embodiments, e.g., where the methods are completely executed on a single computer, the user or client provides for data input and review of data output. A "user" can be a health professional (e.g., a clinician, a laboratory technician, a physician (e.g., an oncologist, surgeon, pathologist), etc.).

[0092] In embodiments where the user only executes a portion of the method, the individual who, after computerized data processing according to the methods of the invention, reviews data output (e.g., results prior to release to provide a complete report, a complete, or reviews an "incomplete" report and provides for manual intervention and completion of an interpretive report) is referred to herein as a "reviewer." The reviewer may be located at a location remote to the user (e.g., at a service provided separate from a healthcare facility where a user may be located).

[0093] Where government regulations or other restrictions apply (e.g., requirements by health, malpractice, or liability insurance), all results, whether generated wholly or partially electronically, are subjected to a quality control routine prior to release to the user.

COMPUTER -BASED SYSTEMS AND METHODS

[0094] The methods and systems described herein can be implemented in numerous ways. In one embodiment of the invention, the methods involve use of a communications infrastructure, for example, the internet. Several embodiments of the invention are discussed below. The present invention may also be implemented in various forms of hardware, software, firmware, processors, or a combination thereof. The methods and systems described herein can be implemented as a combination of hardware and software. The software can be implemented as an application program tangibly embodied on a program storage device, or different portions of the software implemented in the user's computing environment (e.g., as an applet) and on the reviewer's computing environment, where the reviewer may be located at a remote site (e.g., at a service provider's facility).

[0095] In an embodiment of the invention, during or after data input by the user, portions of the data processing can be performed in the user-side computing environment. For example, the user-side computing environment can be programmed to provide for defined test codes to denote a likelihood "score," where the score is transmitted as processed or partially processed responses to the reviewer's computing environment in the form of test code for subsequent execution of one or more algorithms to provide a result and/or generate a report in the reviewer's computing environment. The score can be a numerical score (representative of a numerical value) or a non-numerical score representative of a numerical value or range of numerical values (e.g., "A": representative of a 90-95% likelihood of a positive response; "High": representative of a greater than 50% chance of a positive response (or some other selected threshold of likelihood); "Low": representative of a less than 50% chance of a positive response (or some other selected threshold of likelihood), and the like.

[0096] As a computer system, the system generally includes a processor unit. The processor unit operates to receive information, which can include test data (e.g., level of an RNA transcript or its expression product; level of a reference RNA transcript or its expression product; normalized level of an RNA transcript or its expression product) and may also include other data such as patient data. This information received can be stored at least temporarily in a database, and data analyzed to generate a report as described above.

[0097] Part or all of the input and output data can also be sent electronically. Certain output data (e.g., reports) can be sent electronically or telephonically (e.g., by facsimile, using devices such as fax back). Exemplary output receiving devices can include a display element, a printer, a facsimile device and the like. Electronic forms of transmission and/or display can include email, interactive television, and the like. In an embodiment of the invention, all or a portion of the input data and/or output data (e.g., usually at least the final report) are maintained on a web server for access, preferably confidential access, with typical browsers. The data may be accessed or sent to health professionals as desired. The input and output data, including all or a portion of the final report, can be used to populate a patient's medical record that may exist in a confidential database as the healthcare facility.

[0098] The present invention also contemplates a computer-readable storage medium (e.g., CD-ROM, memory key, flash memory card, diskette, etc.) having stored thereon a program which, when executed in a computing environment, provides for implementation of algorithms to carry out all or a portion of the results of a likelihood assessment as described herein. Where the computer-readable medium contains a complete program for carrying out the methods described herein, the program includes program instructions for collecting, analyzing and generating output, and generally includes computer readable code devices for interacting with a user as described herein, processing that data in conjunction with analytical information, and generating unique printed or electronic media for that user.

[0099] Where the storage medium includes a program that provides for implementation of a portion of the methods described herein (e.g., the user-side aspect of the methods (e.g., data input, report receipt capabilities, etc.)), the program provides for transmission of data input by the user (e.g., via the internet, via an intranet, etc.) to a computing environment at a remote site. Processing or completion of processing of the data is carried out at the remote site to generate a report. After review of the report, and completion of any needed manual intervention, to provide a complete report, the complete report is then transmitted back to the user as an electronic document or printed document (e.g., fax or mailed paper report). The storage medium containing a program according to the invention can be packaged with instructions (e.g., for program installation, use, etc.) recorded on a suitable substrate or a web address where such instructions may be obtained. The computer-readable storage medium can also be provided in combination with one or more reagents for carrying out a likelihood assessment (e.g., primers, probes, arrays, or such other kit components).

[00100] Having described the invention, the same will be more readily understood through reference to the following Examples, which are provided by way of illustration, and are not intended to limit the invention in any way. All citations through the disclosure are hereby expressly incorporated by reference.

EXAMPLE 1

Materials and Methods

[00101] Patients

[00102] One hundred and thirty-six primary breast cancer FFPE tumor specimens with clinical outcomes were provided by Providence St. Joseph Medical Center (Burbank, CA), with institutional review board approval. The time to first recurrence of breast cancer or death due to breast cancer (including death due to unknown cause) was determined from these records. Patients who were still alive without breast cancer recurrence or who died due to known other causes were considered censored at the time of last follow-up or death. These tumor specimens were used for biomarker discovery in the development of the Oncotype DX^® assay. See e.g., U.S. Patent No. 7,081,340; S. Paik et al., The New England Journal of

Medicine 351, 2817 (2004). For the present study, 136 specimens had adequate RNA remaining. Among the 136 patients, 26 experienced breast cancer recurrence or death due to breast cancer.

[00103] RNA-Seq Sample Preparation and Sequencing

[00104] Total RNA was prepared from three ΙΟ-μιη-thick sections of FFPE tumor tissue as previously described using the MasterPure™ Purification Kit (Epicentre^® Biotechnologies, Madison, WI). M. Cronin et al. , The American Journal of Pathology 164, 35 (Jan, 2004). One hundred nanograms of the isolated RNA were depleted of ribosomal RNA as described. See U.S. Pub. No. 2011/0111409. Sequencing libraries for whole transcriptome analysis were prepared using ScriptSeq™ mRNA-Seq Library Preparation Kits (Epicentre^® Biotechnologies, Madison, WI). During the cDNA synthesis step, additional incubation for 90 minutes at 37 °C was implemented in the reverse transcription step to increase library yield. After 3 '-terminal tagging, the di-tagged cDNA was purified using MinElute^® PCR Purification Kits (Qiagen, Valencia, CA). Two 6 base index sequences were used to prepare barcoded libraries for duplex sequencing (RNA-Seq Barcode Primers; Epicentre® Biotechnologies, Madison, WI). PCR was carried out through 16 cycles to generate the second strand of cDNA, incorporate barcodes, and amplify libraries. The amplified libraries were size-selected by a solid phase reversible immobilization, paramagnetic bead-based process (Agencourt®

AMPure® XP System; Beckman Coulter Genomics, Danvers, MA). Libraries were quantified by PicoGreen® assay (Life Technologies, Carlsbad, CA) and visualized with an Agilent Bioanalyzer using a DNA 1000 kit (Agilent Technologies, Waldbronn, Germany). [00105] TruSeq™ SR Cluster Kits v2 (Illumina Inc. ; San Diego, CA) were used for cluster generation in an Illumina cBOT™ instrument following the manufacturer's protocol. Two indexed libraries were loaded into each lane of flow cells. Sequencing was performed on an Illumina HiSeq®2000 instrument (Illumina, Inc.) by the manufacturer's protocol.

Multiplexed single-read runs were carried out with a total of 57 cycles per run (including 7 cycles for the index sequences).

[00106] Data Quality Assessment

[00107] Each sequencing lane was duplexed with two patient sample libraries using a 6 base barcode to differentiate between them. The mean read ratio +/-SD between the two samples in each lane was 1.05+0.38 and the mean +/-SD percentage of un-discerned barcodes was 2.08 +1.63 . Using principal components analysis and other exploratory data analysis methods, no systematic differences were found among samples associated with flow cell or barcode.

[00108] In a run-in phase of the study, duplicate libraries were prepared for 8 samples selected at random from the study set of 136. RefSeq RNA coverage for these libraries ranged between 3.1M and 6.7M uniquely mapped reads. Log count Pearson correlations among duplicate libraries ranged between 0.947 and 0.985. Single libraries were prepared for the remaining 128 samples and distributed in duplex mode among the lanes of 8 flow-cells. Sequencing in 3 lanes failed. Two libraries had low yield, resulting in low coverage. Three lanes were flagged by various Illumina process monitoring indices: low Q30 (coverage = 2.8M and 4.2M), high cluster density (coverage = 1.6M and 1.8M), or inadequate imaging (coverage = 3.3M and 3.1M). For the remaining lanes, sample coverage ranged between 2.5M and 7.3M reads. New libraries for the samples that had low yield were prepared and sequenced. Libraries in the failed and flagged lanes, as well as some of the low coverage samples, were re-sequenced. Replicate correlations among all sequenced samples were very high, 0.985 for the samples with the high cluster density in the original run, and over 0.990 for all others. For the analysis data set, data for one of each of the duplicate libraries from the run-in experiment were kept. For the samples for which new libraries were prepared and for the samples in the failed and flagged lanes, the reads from the subsequent run were used. For the samples with low coverage for which the library was reprocessed, reads from the two runs were pooled. For the rest of the samples, the reads from the single lane were used. Results differed little when other data analysis procedures were used, for example, using only the second run when libraries were reprocessed. [00109] Calculation of Approximate Recurrence Score

[00110] The 21 -gene Oncotype DX^® Recurrence Score^® assay is a proven prognosticator for the risk of distant recurrence in early stage ER-positive breast cancer. An approximation of the Recurrence Score assay for each patient in the Providence cohort was calculated using data from the original 192-gene panel and, because not all of the 21

Recurrence Score genes were included in this panel, by substituting the expression of the same gene with a different assay design, or in one case (CTSL2), a closely related gene, from a subsequent rescreen of the same patient tumor samples that assessed different genes (Table 1 A). Table IB provides the accession no. of the genes used for calculating the Recurrence Score^® and the Approximate Recurrence Score. The Recurrence Score^® standard aggregation and reference normalization were then performed and the approximate Recurrence Score calculated using the algorithm described in Paik et al. (New England Journal of Medicine 351,2817 (2004)), which is herein incorporated by reference. Nine samples were missing one or more of the five reference genes, and therefore an approximate Recurrence Score could not be computed for those samples. Thus, a Recurrence Score was calculated for 127 patients.

[00111] For example, as described in Paik et al. (New England Journal of Medicine 351,2817 (2004), the Recurrence Score on a scale from 0 to 100 is derived from the reference normalized expression measurements in four steps. First, expression for each gene is normalized relative to the expression of the five reference genes (ACTB [the gene encoding b- actin], GAPDH, GUS, RPLPO, and TFRC). Reference-normalized expression measurements range from 0 to 15, with a 1-unit increase reflecting approximately a doubling of RNA. Genes are grouped on the basis of function, correlated expression, or both. Second, the GRB7, ER, proliferation, and invasion group scores are calculated from individual gene-expression measurements, as follows: GRB7 group score = 0.9 x GRB7 + 0.1 x HER2 (if the result is less than 8, then the GRB7 group score is considered 8); ER group score = (0.8 x ER +1.2 x PGR + BCL2 + SCUBE2) ÷ 4; proliferation group score = Survivin + KI67 + MYBL2 + CCNB1 [the gene encoding cyclin B l] + STK15) ÷ 5 (if the result is less than 6.5, then the proliferation group score is considered 6.5); and invasion group score = (CTSL2 [the gene encoding cathepsin L2] + MMP11 [the gene encoding stromolysin 3]) ÷ 2. The unsealed recurrence score (RSU) is calculated with the use of coefficients that are predefined on the basis of regression analysis of gene expression and recurrence in the three training studies: RSU = + 0.47 x GRB7 group score - 0.34 x ER group score +1.04 x proliferation group score + 0.10 x invasion group score + 0.05 x CD68 - 0.08 x GSTM1 - 0.07 x BAG1. A plus sign indicates that increased expression is associated with an increased risk of recurrence, and a minus sign indicates that increased expression is associated with a decreased risk of recurrence. Fourth, the recurrence score (RS) is rescaled from the unsealed recurrence score, as follows: RS = 0 if RSU < 0; RS=20 x (RSU - 6.7) if 0 < RSU <100; and RS = 100 if RSU > 100.

GSTM1 NM_000561

HER2 NM_004448

Ki-67 NM_002417

MYBL2 NM_002466

PR NM_000926

STK15 NM_003600

STMY3 NM_005940

SURV NM_001168

B-actin NM_001101

GAPDH NM_002046

GUS NM_000181

RPLPO NM_001002

TFRC NM_003234

[00112] Bioinformatic Methods

[00113] With the exception noted below, all primary analysis of sequence data was performed in CASAVA 1.7, the standard data processing package from Illumina. Demultiplexing of sample indices was set with 1 mismatch tolerance to separate the two samples within each lane. Raw FASTQ sequences were trimmed from both ends before mapping to the human genome (UCSC release, version 19), to address 3' end adapter contamination and random RT primer artifacts, and 5' end terminal-tagging oligonucleotide artifacts. The libraries as prepared contain strand-of-origin (directional) sequence information. Annotated RNA counts (defined by refFlat.txt from UCSC) were calculated by CASAVA 1.7 both with and without consideration of strand-of-origin information. Although retained in the mapping process, CASAVA does not provide directional counts by default. These counts were obtained by splitting the mapped (export.txt) file into two parts, one with sense strand counts, the other with antisense strand counts, and processing them independently. Raw FASTQ sequence was mapped with Bowtie (B. Langmead et al., Genome Biology 10, R25 (2009)) in parallel with CASAVA to count ribosomal RNA transcripts.

[00114] Statistical Methods

[00115] Data were analyzed in 3 categories: first, RefSeq RNAs, about 80% of which are exon sequences, consolidated for each gene; second, intronic RNA sequences, consolidated for each gene; third, intergenic sequences. RNAs with maximum counts less than 5 among the 136 patients were excluded from analysis. Of 21,283 total RefSeq transcripts counted by CASAVA, 821 had a maximum count less than 5, leaving 20,462 RefSeq transcripts for analysis. Similar to a recently published procedure described by Bullard et al. {BMC Bioinformatics 11, 94 (2010)), log₂ raw RNA counts (setting the log₂ for a 0 count to 0) were normalized by subtracting the 3rd quartile of the log₂ RefSeq RNA counts and adding the cohort mean 3rd quartile ("Q3 normalization"). For analysis of RefSeq and intergenic RNAs normalization, RefSeq RNA data were used. For analysis of intronic RNAs normalization, intronic RNA data were used.

[00116] A discovery analysis was run using the 127 patient samples with an Approximate Recurrence Score to identify RNA sequences that provided prognostic information over and above the prognostic information for breast cancer recurrence provided by the approximate Recurrence Score. For each of the exons, introns, or intergenic sequences, a multivariate Cox proportional hazards regression model {Journal of the Royal Statistical Society: Series B (Methodological) 34, 187 (1972)) was fit with effects for (1) approximate Recurrence Score and (2) the normalized gene expression of the individual RNA sequence. The robust standard error estimate of Lin and Wei {Journal of the American Statistical Society, 84, 1074 (1989)) was used to accommodate possible departures from the assumptions of Cox regression, including nonlinearity of the relationship of RNA sequence expression with log hazard and nonproportional hazards. A Wald test p- value was computed for the additional contribution of the each individual RNA sequence over and above the approximate Recurrence Score. False discovery rates (FDR, ^-values) were assessed using the method of Storey {Journal of the Royal Statistical Society, Series B 64, 479 (2002)) with a "tuning parameter" of λ=0.5. RNA sequences with ^-values (FDR) less than 20% were identified as adding prognostic value over and above the Recurrence Score assay.

EXAMPLE 2

Identification of RNA Sequences that Provide Prognostic Information Over and Above the Approximate Recurrence Score

[00117] The goal of performing the present analysis was to identify RefSeqs and non-coding RNAs that contribute to recurrence risk over and above the contribution of Recurrence Score (that is, provide orthogonal risk information). Data regarding the identified RefSeqs and non-coding RNAs that contribute to recurrence risk over and above the contribution of Recurrence Score are shown in Tables 2-7, as described below.

[00118] This analysis identified 115 RefSeqs as adding prognostic information over and above the approximate Recurrence Score (Tables 2 and 3). As shown in Table 3, 91 of these 115 RefSeqs were previously identified in the list of 1307 RefSeqs identified in univariate analyses (without adjustment for Recurrence Score) as being associated with recurrence risk at FDR 10% (PCT/US2012/063313). The 24 newly identified RefSeq RNAs are provided in Table 2.

[00119] Discovery analysis to identify intronic RNAs that add prognostic value over and above the approximate Recurrence Score was also performed. Working at an FDR of 20%, the analysis identified 88 intronic RNAs as adding prognostic value (Tables 4 and 5). Previously, the main analysis (not adjusted for approximate Recurrence Score) had identified 1698 intronic RNAs as being associated with recurrence risk at an FDR of 10%

(PCT/US2012/063313). 74 of the 88 intronic RNAs identified in the analysis conditional on approximate Recurrence Score are on this list of 1689 (Table 5). The 14 newly identified introns are provided in Table 4.

[00120] Discovery analysis for 2101 intergenic sequences conditional on the approximate Recurrence Score was also performed. The analysis identified 5 intergenic sequences as adding prognostic information over and above Recurrence Score at an FDR of 20% (Tables 6 and 7). Table 6 provides two (2) of these 5 introns that were not in the list of 194 intergenic sequences previously identified using univariate analyses

(PCT/US2012/063313). Table 7 provides the three (3) introns that were previously identified in PCT/US2012/063313.

[00121] All references cited throughout the disclosure, including the examples, are hereby expressly incorporated by reference for their entire disclosure.

[00122] While the present invention has been described with reference to what is considered to be specific embodiments, it is to be understood that the invention is not so limited. To the contrary, the invention is intended to cover various modifications and equivalents included within the spirit and scope of the appended claims.

Table 2: 24 Newly Identified RefSeq RNAs that Add Prognostic Information Over and Above the Approximate Recurrence Score

Table 3: 91 Previously Identified RefSeq RNAs that Add Prognostic Information Over and Above the Approximate Recurrence Score

Table 4: 14 Newly Identified Intronic RNAs that Add Prognostic Information Over and Above the Approximate Recurrence Score

Table 5: 74 Previously Identified Intronic RNAs that Add Prognostic Information Over and Above the Approximate Recurrence

10332297 10332365-10335484,10335562- 10336377 10336458-10338042,10338187- 10342456 10342592-10351138,10351220- 10352103 10352181-10355142,10355224- 10355716 10355825-10356638,10356724- 10356953 10357136-10357230,10357305- 10363219

10270937 10292306, 10292493

10316303 10316382- 10318549 10318731- 10321961 10322029- 10327436 10327617- 10328208 10328322- 10331558 10331638- 10332297 10332365- 10335484 10335562- 10336377 10336458- 10338042 10338187- 10342456 10342592- 10351138 10351220- 10352103 10352181- 10355142 10355224- 10355716 10355825- 10356638 10356724- 10356953 10357136- 10357230.10357305- 10380099 10380195- 10381765 10381916- 10383940.10384121- 10384814.10384954- 10386167 10386418- 10394576 10394697- 10396713 10396801- 10397130.10397262- 10397427 10397592- 10399825 10399918- 10402106 10402227- 10403288. 10403346- 10405901 10406012- 10407817. 10407886- 10408705 10408792- 10412687. 10412795- 10420985 10421102- 10421748. 10421884- 10423339 10423403- 10425156. 10425304- 10425465 10425707- 10428523 10428597- 10431197 10431321- 10434372. 10434524- 10434910 10435105- 10435311 10435432-

26b KIF1B NM 015074 chrl 10436601 (+) 0.000202 2.25 (1.47, 3.46) 7.80%

38933391 38933503,38933555- 38933765 38933985-38935752,38935880- 38935950 38936114-38936550,38936709-

27 KRT27 NM 181537 chrl7 38937438 38937523-38938300, i±i 0.000315 1.45 (1.18, 1.77) 9.69%

48961801 ■48962298. 48962376-

28 LALBA NM_002289 chrl2 48962863 ,48963024- 48963669, (+) 6.60E-06 1.68 (1.34, 2.10) 0.67%

29 LCE1C NM 178351 chrl 152777975-152779075, (+) 1.63E-11 1.59 (1.39, 1.82) <0.001%

30 LCE3E NM 178435 chrl 152538706-152539212, (+) 3.55E-07 1.43 (1.25, 1.65) 0.11%

31 LINGO 1 NM 032808 chrl 5 77908243-77924650, (+) 0.000747 1.96 (1.33. 2.90) 16.74%

Table 6: Two Newly Identified Intergenic Sequences that Add Prognostic Information Over and Above the Approximate

Recurrence Score at an FDR of 20%

Table 7: Three Previously Identified Intergenic Sequences that Add Prognostic Information Over and Above the Approximate

Recurrence Score at an FDR of 20%

Claims

WHAT IS CLAIMED IS:

1. A method of predicting a likelihood of long-term survival without recurrence of breast cancer in a breast cancer patient, comprising:

determining a level of one or more breast cancer prognostic biomarkers in a breast cancer tumor sample obtained from the patient, wherein the one or more breast cancer prognostic biomarkers is selected from:

(a) one or more RNA transcripts, or expression products thereof, selected from Table 2,

(b) one or more intronic RNAs selected from Table 4,

(c) one or more intergenic sequences selected from Table 6,

normalizing the level of the one or more breast cancer prognostic biomarkers to obtain a normalized level of the one or more breast cancer prognostic biomarkers; and

predicting a likelihood of long-term survival without recurrence of breast cancer of said patient, wherein an increased normalized level of the one or more breast cancer prognostic biomarkers is negatively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the breast cancer prognostic biomarker is marked "+" in Tables 2, 4, or 6, and wherein an increased normalized level of the one or more breast cancer prognostic biomarker is positively correlated with an increased likelihood of long-term survival without recurrence of breast cancer if the direction of association of the one or more breast cancer prognostic biomarker is marked "-" in Tables 2, 4, or 6.

2. The method of claim 1 , wherein the breast cancer patient is an estrogen receptor (ER)-positive breast cancer patient.

3. The method of claim 1 or claim 2, wherein the level is determined by whole transcriptome sequencing.

4. The method claim 1 or claim 2, wherein the level is determined by reverse transcription polymerase chain reaction (RT-PCR).

5. The method of claim 1 or claim 2, wherein the level is determined by microarray.

6. The method of any one of claims 1-5, wherein the breast cancer tumor sample is a fixed, wax-embedded tissue sample.

7. The method of any one of claims 1-5, wherein the breast cancer tumor sample is a fine needle biopsy sample.

8. The method of any one of claims 1-5, wherein the breast cancer tumor sample is a bodily fluid sample that contains cancer cells.

9. The method of any one of claims 1-8, further comprising creating a report based on the level of the one or more RNA transcripts, or an expression product thereof.