WO2013142939A1

WO2013142939A1 - Methods for predicting and classifying event outcomes

Info

Publication number: WO2013142939A1
Application number: PCT/BR2013/000102
Authority: WO
Inventors: Ricardo Renzo Brentani; Renato David PUGA
Original assignee: Fundação Antônio Prudente; Supremum Assessoria E Consultoria Ltda.
Priority date: 2012-03-30
Filing date: 2013-04-01
Publication date: 2013-10-03
Also published as: BR102012007246A2

Abstract

The present invention provides predictive methods of classifying a latent biological sample according to a more or less probable phenotypic outcome, predictive methods of classifying an alternative disease state according to a more or less favorable prognosis, and predictive methods of classifying Gleason 7 stage prostate cancer according to a more or less probable recidivism. The predictive methods include steps of: obtaining a primary data collection; statistically generating a small, high-resolution classifier from the primary data; and using the classifier in a clinical setting to classify according to a more or less probable phenotypic outcome or a more or less favorable prognosis.

Description

METHODS FOR PREDICTING AND CLASSIFYING EVENT OUTCOMES

FIELD OF THE INVENTION

[0001] The present invention relates to mathematical methods that may be applied to latent biological and molecular information to generate a predictive model for classifying event outcomes. Compositions of logistic-normal densities are applied to transcriptome, genome, and/or proteome information obtained from biologically distinct groups of patients to identify differentially expressed genes in digital gene expression profiles. The most differentially expressed genes are selected into a small, high-quality classifier, and resolution of the classifier is increased by integrating the group probability densities of the most differentially expressed genes into a pair of marginal, multivariate probability densities and corresponding log-odds vectors according to two outcomes. The marginal, multivariate probability densities are further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof. The small, high-quality classifier may be used in a clinical setting to provide a high-resolution, differential diagnostic of disease.

BACKGROUND OF THE INVENTION

[0002] Prostate cancer (PCa) is the most common non-dermatological cancer in males worldwide. The two most widely accepted prognostic factors for prostate cancer are preoperative serum prostate specific antigen (PSA) levels and the cancerous cell differentiation Gleason score evaluated at biopsy. PSA levels above lOng/ml and a Gleason score of 8 and higher, are thought to indicate a poor prognosis of disease outcome. The widespread adoption of screening based upon PSA levels, for example, has led to the earlier detection and diagnosis of prostate cancer, with most cases appearing confined to the prostate gland at presentation.

[0003] However, while such early diagnosis based upon PSA and Gleason score parameters provides an opportunity to cure men with organ-confined disease, up to 30% of men undergoing radical prostatectomy as primary therapy for such tumors will ultimately relapse, presumably as a result of latent, micro-metastatic disease present at the time of surgery. Further, these parameters are not prognostic in a significant fraction of the patients, especially when the Gleason score is an intermediate grade (i.e., a score of 7), evidencing the need for further markers for improving the efficacy of therapeutical interventions.

[0004] The Gleason scoring system is based upon microscopic tumor patterns that are measured by the pathologist, based on a prostate biopsy. The pathologist examines the specimen and attempts to give two scores: a primary grade represents the visible majority of the tumor specimen; and a secondary grade relating to the minority of the visible tumor pattern. These scores are then added to obtain the final Gleason score. Thus, this classification system based upon biological specimens is not an exact science, is clearly subjective by nature, and the outcomes for a singular biopsy specimen read by two different pathologists may differ.

[0005] Therefore a critical issue in the care of men with prostate cancer is to improve the risk stratification of patients with intermediate risk disease. While serum PSA levels and Gleason scores remain the most important variables with which to predict disease behavior and may successfully distinguish between men at low, intermediate, and high risk for tumor recurrence following local therapy, these measures are less successful in helping guide therapy for the majority of men falling into the intermediate risk group.

[0006] The advent of cDNA- and oligo- array methodology led to the generation of differential gene expression profiles which allowed better discrimination between a good or bad prognosis. A recent example was provided by a study showing that the androgen response, not dependent only on the expression of genes that contain an androgen responsive element (ARE) in their promoter region (REF) but also on that of ARE- genes, regulated by the former, as represented by a gene set comprised by 142 genes, can separate PCa tumor from normal tissues. The prediction of lethal PCa disease has also been recently improved by a 157 gene signature.

[0007] Alternatively, the Human Cancer Genome project which provides an approximately 80% coverage of the transcriptome, has revealed that a significant fraction of transcripts are not translated into protein. Since each DNA locus gives rise to an average of six transcripts, only one of which is translated, by using micro arrays of intronic material and by selecting, among others, prostate cancer candidate markers, a differential expression pattern between several distinct tumors and their normal tissue counterparts can be generated.

[0008] However, these kinds of analyses, though of great biological significance, are too cumbersome to be introduced in the urological clinical practice and therefore other approaches are required in order to refine prostate cancer management. Thus, a need remains for high-resolution, accurate, and efficient methods of classifying prostate cancer and predicting prostate cancer outcomes in a clinical setting.

SUMMARY OF THE INVENTION

[0010] To solve the limitations described above, and in order to provide a different and more sensitive prognostic approach, the present invention is based upon a completely different strategy utilizing mathematical and statistical methods of classification. In many instances, traditional classification of a sample from an individual into particular disease classes has often proven to be difficult, incorrect, or equivocal because the classification is dependent upon an individual and variable ability to visibly discern biological distinctions among cell or tissue samples. Further, in traditional methods such as histochemical analyses, immunophenotyping, and cytogenetic analyses, only one or two characteristics of the sample are analyzed to determine the sample's classification. The present invention, however, is a predictive method of classifying a biological sample according to a more or less probable phenotypic outcome based upon latent gene expression patterns resolved by synergistic, objective mathematical modeling.

[0011] Specific embodiments of the present invention are disclosed herein. However, it is to be understood that the disclosed embodiments are merely illustrative of the invention that may be embodied in various forms.

[0012] In one embodiment, a predictive method of classifying a latent biological sample according to a more or less probable phenotypic outcome includes: (a) selecting a phenotypic outcome; obtaining a primary molecular data collection from a subject population existing as latent classes A and B having a set of distinct molecular features, and a subject of latent class A exhibits the selected phenotypic outcome and a subject of latent class B does not;

(b) approximating a posterior distribution of molecular data frequencies for each subject in the population;

(c) constructing two independent compositional data frequencies for each molecule in the data collection according to latent classes A or B;

(d) comparing the compositional data frequencies of latent classes A and B for each molecule and selecting the most differentially resolving molecules into a small, high-quality classifier;

(e) increasing resolution of the small, high-quality classifier by integrating the compositional data frequencies of the most differentially resolving molecules into a pair of marginal, multivariate probability densities according to latent classes A or B;

(f) calculating a multivariate probability density for a latent biological sample according to the small, high-quality classifier; and

(g) classifying the biological sample according to a more or less probable phenotypic outcome by comparing the sample multivariate density to the pair of marginal densities.

[0013] In one embodiment, a predictive method of classifying an alternative disease state according to a more or less favorable prognosis includes: obtaining a primary data collection; statistically generating a small, high-resolution classifier from the primary data; and using the classifier in a clinical setting to provide a differential diagnostic of disease.

[0014] In one embodiment, a predictive method of classifying Gleason 7 stage prostate cancer according to a more or less probable recidivism includes:

(a) extracting mRNA from a Gleason 7 stage subject population existing as latent recidivist and non-recidivist classes A and B having a set of distinct molecular features, and generating cDNA libraries tagged by probes specific to individual patients; (b) qualitatively sequencing the cDNA libraries as a high-throughput batch, and deconvolving the sequencing results based on the patient-specific tags;

(c) approximating a posterior distribution of molecular data frequencies for each subject, under a Jeffreys non-informative prior, by a logistic normal distribution;

(d) constructing two independent group probability densities for each gene in the data collection according to latent classes A or B;

(e) comparing the group probability densities of latent classes A and B for each gene, distributing each gene differentially or identically according to these probability densities, and selecting the most differentially resolving genes into a small, high- quality classifier;

(f) increasing resolution of the small, high-quality classifier by integrating the group probability densities of the most differentially resolving genes into a pair of marginal, multivariate probability densities and corresponding log-odds ratio vectors according to latent classes A or B;

(g) calculating a multivariate probability density for a next Gleason 7 stage patient according to the genes being members of the small, high-quality classifier;

(h) classifying in a clinical setting the next Gleason 7 stage patient according to a more or less probable recidivism by referring to the pairs of marginal group probability densities.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] FIGS. la-Id illustrate the discriminative capacity of singular and concerted expression levels.

[0016] FIG. 2 illustrates a 99% predictive value plateau beginning with a (6) six-gene set.

DETAILED DESCRIPTION

[0017] The present invention is described in detail below in connection with some embodiments for purposes of illustration only. Modifications to particular embodiments within the spirit and scope of the present invention, as set forth in the appended claims, will be readily apparent to those of ordinary skill in the art.

(0018] The present invention allows the use of its concepts with wide variations of event outcomes, molecular data, mathematical and statistical methods, and classifiers. There may also be variations in the methods of the industrial production of the classifiers, and their uses in clinical settings.

[0019] In general, the present invention relates to methods for classifying a biological sample with respect to a predicted phenotypic outcome according to the latent molecular data profile of the sample.

[0020] Phenotypic outcomes

[0021] As used herein, a "phenotypic outcome" according the present invention refers to any observable biological characteristic or traits resulting from the expression of an organism's molecular data, the influence of environmental factors, and/or interactions between molecular data and the environment.

[0022] In one embodiment, a phenotypic outcome according to the present invention is a morphological, developmental, biochemical or physiological property; a behavior or a product of behavior; or combinations thereof. In one embodiment, the phenotypic outcome a more or less favorable prognosis with respect to an alternative disease state. In one embodiment, the alternative disease state is intermediate stage cancer. In one embodiment, the alternative disease state is Gleason 7 stage prostate cancer. In one embodiment, the phenotypic outcome is a more or less probable recidivism.

[0023] Molecular data

[0024] As used herein, "molecular data" according to the present invention refers to any latent quantifiable characteristic expressed with respect to formation, structure, and/or function, of nucleic acids, peptides, and other macromolecules essential to life.

[0025] In one embodiment, molecular data according to the present invention is collected from a transcriptome, genome, proteome, or combinations thereof. In one embodiment, the molecular data is collected from an interactome. In one embodiment, the molecular data is mRNA. In one embodiment, the molecular data generates cDNA libraries. In one embodiment, the cDNA libraries are qualitatively sequenced as a high-throughput batch. In one embodiment, the molecular data is collected after a de-convolution of the high- throughput batch sequencing results.

[0026] Mathematical and statistical methods

[0027] As used herein, "mathematical" and "statistical" methods according to the present invention refers to any descriptive or inferential method of assessing data with respect to its measurement, properties, patterns, and/or relationships of quantities and sets, using numbers and symbols.

[0028] In one embodiment, mathematical and statistical methods according to the present invention include numeric and/or symbolic assessment of data to resolve probabilities of alternative outcomes. In one embodiment, the mathematical and statistical methods use data to update the uncertainties of competing probability models. In one embodiment, the mathematical and statistical methods determine model parameters, predict unknown variables, and/or perform model selection.

[0029] In one embodiment, mathematical and statistical methods according to the present invention require the formulation of a set of prior or posterior probability distributions for any unknown parameters. In one embodiment, the prior or posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account. In one embodiment, the prior or posterior probability distribution is the distribution of an unknown quantity, treated as a random variable, conditional on the data collected.

[0030] In one embodiment, mathematical and statistical methods according to the present invention include assessment of molecular data collected from a population of subjects. In one embodiment, the methods include approximating a posterior distribution of molecular data frequencies for each subject. In one embodiment, the posterior distribution of molecular frequencies is approximated by a logistic normal distribution, logit, logits, probit, logistic function, logistic regression, log-odds, a logit-normal distribution, or equivalents thereof. In one embodiment, approximating a posterior distribution of molecular frequencies includes a Jeffreys non-informative prior. In one embodiment, the posterior distribution of molecular frequencies is approximated by a logistic normal distribution. In one embodiment, a log-odds ratio vector corresponding to the logistic normal distribution has an asymptotic multivariate normal distribution with means and covariance matrix wholly determined by using digamma and trigamma functions. In one embodiment, the posterior distribution of molecular data frequencies for each subject is further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof. In one embodiment, the posterior distribution of molecular data frequencies for each subject is a multivariate normal density.

[0031] In one embodiment, following the calculation of each subject multivariate density, the mathematical and statistical methods of the present invention includes construction of compositional data frequencies. In one embodiment, the compositional data frequencies are group probability densities, joint probability densities, multivariate densities, or equivalents thereof. In one embodiment, the compositional data frequency is a group probability density. In one embodiment, the corresponding log-odds ratio vector corresponding to a group probability density has an asymptotic multivariate normal distribution with means and covariance matrix wholly determined by using digamma and trigamma functions. In one embodiment, the compositional data frequencies are further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof.

[0032] In one embodiment, the molecular data of the present invention includes subsets. In one embodiment, two independent compositional data frequencies are constructed for each subset in the molecular data. In one embodiment, each subset is distributed differentially or identically, according to the independent compositional data frequencies, determining its predictive value. In one embodiment, the predictive value is a probability near zero or one. In one embodiment, a probability near one means that the molecular data is more expressed in subset to the other, and wherein a probability near zero means that the molecular data is less expressed in one subset relative to the other. [0033] In one embodiment, mathematical and statistical methods according to the present invention incudes a principal component analysis (PCA), a mathematical procedure that uses an orthogonal transformation to convert a data set of possibly correlated variables into a set of values of uncorrelated variables called principal components (A). The number of k principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the k first principal component has as high a variance as possible {i.e., accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to or uncorrelated with the preceding components. Principal components are guaranteed to be independent if the data set is jointly normally distributed. PCA is sensitive to the relative scaling of the original variables.

[0034] In one embodiment, the individual multivariate normal density for each subject is considered for a log-odds vector with respect to a principal component analysis. In one embodiment, the sum of the k last components subtracted from the sum of the k first components yields a score for each subject. In one embodiment, this score is a linear combination of normally distributed variables. In one embodiment, the score has a univariate normal distribution with known mean and variance.

[0035] In one embodiment, following the construction of group probability densities for each molecular data subset, the same weighted average of score densities were taken within each group to obtain a pair of group score densities to which the score of a next subject is to be referred. In one embodiment, the mathematical and statistical method of obtaining a group score density follows from the definition of the weighing system that averages the multivariate densities of the subjects in the group. In one embodiment, the weight for each subject is the size of the subject's collected molecular data. In one embodiment, wherein the primary data collected is mRNA, the weight for each subject is the weight of the subject's cDNA library obtained in a gene-sequencing process. In one embodiment, the mathematical and statistical method of obtaining group score densities has connections to a meta-analysis that composes information provided by each data-generating subject. In one embodiment, the data-generating subject generates a weighted multivariate density used in the construction of weighted group probability densities and weighted group score densities.

[0036] In one embodiment, the molecular data is genomic or nucleic acid data and the two independent group-densities allow one to compute the probability that a gene is differentially expressed between two alternative outcome probability subsets. In one embodiment, outcome probabilities are calculated for all genes considered and the genes are ordered according to these probability values. In one embodiment, genes having distinct expression between subsets possess probabilities near zero or one. In one embodiment, these genes will be the focus of attention in the classification or diagnosis of a next subject. In one embodiment, it is most desirable to have a score for the classification or diagnosis of a next subject that takes into account genes having extreme probabilities near zero or one.

[0037] In one embodiment, the outcome probability subsets are characterized by a more or probable prognosis. In one embodiment, the outcome probability subsets are characterized by recidivism or non-recidivism. In one embodiment, a gene is more or less expressive in a recidivist or non-recidivist subset relative to the other.

[0038] Classifiers

[0039] As used herein, a "classifier" refers to any small set of molecular data comprising the most differentially resolving subsets according to the present invention.

[0040] In one embodiment, the compositional data frequencies of latent molecular data subsets are compared, and the most differentially resolving subsets are selected into a small, high-quality classifier. In one embodiment, the resolution of the small, high-quality classifier is increased by integrating the compositional data frequencies of the most differentially resolving molecules into a pair of marginal, multivariate probability densities and corresponding logs-odd vectors. In one embodiment, a multivariate probability density for a latent biological sample is calculated according to the small, high-quality classifier. In one embodiment, the biological sample is classified according to a more or less probable phenotypic outcome by comparing the sample multivariate density to the pair of marginal densities. In one embodiment, the marginal, multivariate probability densities are further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof. In one embodiment, for a next subject, the latent molecular subsets being members of the classifier are compared with the pairs of marginal group probability densities and scores, and the subject is classified corresponding to a more or less probable phenotypic outcome.

[0041] In one embodiment, a small, high-resolution classifier is statistically generated from the primary data collected. In one embodiment, the classifier is used in a clinical setting to provide a differential diagnostic of disease. In one embodiment, for a next subject, the molecular data expression values for the genes being members of a classifier are compared with the pairs of marginal group probability densities and scores, and the subject is classified corresponding to a more or less favorable outcome or prognosis. In one embodiment, the small, high-resolution classifier is used in a clinical setting. In one embodiment, the classifier is used in a clinical setting to classify a next Gleason 7 stage patient according to a more or less probable recidivism by referring to the pairs of marginal group probability densities. In one embodiment, the small, high-quality classifier includes a single gene. In one embodiment, the classifier includes from about one to about twelve genes. In one embodiment, the predictive value of the classifier is about 99%.

[0042] Predictive methods

[0043] In one embodiment, the present invention relates to a predictive method of classifying a latent biological sample according to a more or less probable phenotypic outcome, including: (a) selecting a phenotypic outcome; (b) obtaining a primary molecular data collection from a subject population existing as latent classes A and B having a set of distinct, molecular features, wherein a subject of latent class A exhibits the selected phenotypic outcome and a subject of latent class B does not; (c) approximating a posterior distribution of molecular data frequencies for each subject in the population; (d) constructing two independent compositional data frequencies for each molecule in the data collection according to latent classes A or B; (d) comparing the compositional data frequencies of latent classes A and B for each molecule and selecting the most differentially resolving molecules into a small, high-quality classifier; (e) increasing resolution of the small, high-quality classifier by integrating the compositional data frequencies of the most differentially resolving molecules into a pair of marginal, multivariate probability densities according to latent classes A or B; (f) calculating a multivariate probability density for a latent biological sample according to the small, high-quality classifier; and (g) classifying the biological sample according to a more or less probable phenotypic outcome by comparing the sample multivariate density to the pair of marginal densities.

[0044] In one embodiment, the present invention relates to a predictive method of classifying an alternative disease state according to a more or less favorable prognosis, including: (a) obtaining a primary data collection; (b) statistically generating a small, high- resolution classifier from the primary data; and (c) using the classifier in a clinical setting to provide a differential diagnostic of disease. In one embodiment, the alternative disease state exists as latent classes A and B corresponding to a more or less favorable prognosis and the latent classes exhibit a distinct set of molecular features. In one embodiment, the primary data collection includes an extraction of mRNA from two biologically distinct groups of patients corresponding to latent classes A and B, and a generation of cDNA libraries tagged by probes specific to individual patients. In one embodiment, the primary data collection further includes a qualitative sequencing of the cDNA libraries as a high- throughput batch, and a de-convolution of sequencing results based on the tags. In one embodiment, the high-throughput batch sequencing is a massively parallel signature sequencing, polony sequencing, parallelized pyrosequencing, reversible dye-terminator sequencing, ligation sequencing, ion semiconductor sequencing, DNA nanoball sequencing, single molecule sequencing, nanopore DNA sequencing, hybridization sequencing, microfluidic Sanger sequencing, or equivalents thereof. In one embodiment, group probability densities of latent classes A and B are compared for each gene, and the most differentially resolving genes are selected into a small, high-quality classifier. In one embodiment, the resolution of the small, high-quality classifier is increased by integrating the group probability densities of the most differentially resolving molecules into a pair of marginal, multivariate probability densities and corresponding log-odds vectors according to latent classes A or B. In one embodiment, the marginal, multivariate probability densities are further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof. In one embodiment, for an individual patient, the cDNA expression values for the genes being members of the classifier are compared with the pairs of marginal group probability densities corresponding to a more or less favorable prognosis.

[0045] As described herein, gene expression patterns in two groups of patients - Gleason 7 recidivist, and Gleason 7 non-recidivist prostate cancer patients - were evaluated in order to determine if: (1 ) the calculation of independent group-densities allowed one to compute the probability that a gene is differentially expressed between the groups; and (2) if a differential expression is associated with a predicted clinical outcome (either recidivism or non-recidivism). Genes having distinct expression between groups possess probabilities near zero or one, and are integrated into a small, high-quality classifier that will be used in the clinical diagnosis of the next Gleason 7 patient.

[0046] Referring particularly to the drawings, which show some embodiments of the present invention, there is illustrated in FIGS. la-Id the discriminative capacity of a single gene's expression level (FIG. la), or that of the concerted expression levels of four (FIG. lb), six (FIG. lc), and twelve genes (FIG. Id). More specifically, FIG. la illustrates the discriminative power of a single gene (RPL35) expression level; FIG. lb shows the combined expression levels of RPL35, RPS28, SRSF5, LOC100293090; FIG. lc shows RPL35, RPS28, C12orf57, PODXL, SRSF5, LOC 100293090; and FIG. Id illustrates the concerted expression of RPL35, RPS28, C12orf57, NFKBIZ, RPS15, UBA52, PNN, MTRNR2L10, SLC25A4, PODXL, SRSF5, and LOC 100293090.

[0047] The cut-off in FIGS. la-Id is defined as the value in the x-axis that defined the interception of the two densities. Taking the score and cut-off together, the sensitivity and specificity reached with a (6) six-gene set was 100% and 82%, respectively. This is a significant result even for a small sample (i.e., 21 patients).

[0048] FIG. 2 illustrates the predictive values of the concerted expression levels of the best gene sets with 1, 2, 4, 6, 8, 10 and 12 genes. A (6) six-gene set indicates a 99% probability that the value of the compositional set in non-recurring (NR) patients is higher than in recurring (R) patients. Thus, a 99% predictive value of the concerted expression levels of several genes reaches a plateau with six genes (k = 3), and this group of six genes comprises a "compositional set" that may be used as or within a small, high-resolution classifier.

[0049] In one embodiment, the present invention relates to a predictive method of classifying Gleason 7 stage prostate cancer according to a more or less probable recidivism, including: (a) extracting mRNA from a Gleason 7 stage subject population existing as latent recidivist and non-recidivist classes A and B having a set of distinct molecular features, and generating cDNA libraries tagged by probes specific to individual patients; (b) qualitatively sequencing the cDNA libraries as a high-throughput batch, and de-convoluting the sequencing results based on the patient-specific tags; (c) approximating a posterior distribution of molecular data frequencies for each subject, under a Jeffreys non- informative prior, by a logistic normal distribution; (d) constructing two independent group probability densities for each gene in the data collection according to latent classes A or B; (e) comparing the group probability densities of latent classes A and B for each gene, distributing each gene differentially or identically according to these probability densities, and selecting the most differentially resolving genes into a small, high-quality classifier; (f) increasing resolution of the small, high-quality classifier by integrating the group probability densities of the most differentially resolving genes into a pair of marginal, multivariate probability densities and corresponding log-odds ratio vectors according to latent classes A or B; (g) calculating a multivariate probability density for a next Gleason 7 stage patient according to the genes being members of the small, high-quality classifier; and (h) classifying in a clinical setting the next Gleason 7 stage patient according to a more or less probable recidivism by referring to the pairs of marginal group probability densities.

[0050] In one embodiment, the genes having distinct expression between latent classes A and B possess probabilities near zero or one. In one embodiment, a probability near one means that the gene is more expressed in one latent class relative to the other and a probability near zero means that the gene is less expressed in one latent class relative to the other. In one embodiment, the small, high-quality classifier comprises a single gene. In one embodiment, the single gene is RPL35. In one embodiment, the single gene is SRSF5. In one embodiment, the classifier comprises from about one to about twelve genes. In one embodiment, the classifier comprises: RPL35, RPS28, SRSF5, and LOC 100293090. In one embodiment, the classifier comprises: RPL35, RPS28, C12orf57, PODXL, SRSF5, and LOC100293090. In one embodiment, the classifier comprises: RPL35, RPS28, C12orf57, NFKBIZ, RPS 15, UBA52, PNN, MTRNR2L10, SLC25A4, PODXL, SRSF5, and LOCI 00293090. In one embodiment, the predictive value of the classifier is about 99%. In one embodiment, the classifier is suitable for diagnostic use in a clinical setting.

EXAMPLES

[0051] Gleason 7 Prostate Cancer

[0052] Gleason 7 tumors display great morphological heterogeneity with different regions, or foci, presenting either a Gleason 3 or a higher grade Gleason pattern. The disease outcome depends on the proportion of the different Gleason patterns found in the patient's tumor.

[0053] Working under the assumption that the number of gene sequences is proportional to the concentration of gene transcripts in the total mRNA pool of a given sample, a method suitable for structural, quantitative, and qualitative assessment of complex transcriptomes was previously developed using a mixture of tagged cDNA libraries generated from immortalized and normal breast cell lines. This method, including high-throughput sequencing, allowed the identity of the specimen from which each sequence is originated to be obtained digitally, while a differential gene expression profile is also obtained for each cell line. High-throughput sequencing may provide a more robust quantification of gene expression, and thus a better differential expression profiling, than the array methodology alone.

[0054] To then search for prognosis markers that could distinguish between recurrent and non-recurrent Gleason score 7 prostate cancer patients, the above method was combined with a novel sophisticated statistical analysis.

[0055] Tagged cDNA libraries were prepared from laser-dissected samples obtained from twenty-one (21) Gleason 7 prostate cancer patients during surgery and submitted to high- throughput sequencing. A total of 868,554 sequences were obtained, complete with both 5 'and 3' end primers and the six nucleotide-long tags. After identifying the tags for each patient, two sets of patients were established - eleven (1 1) Gleason 7 patients with biochemical recurrence, and ten (10) Gleason 7 patients without biochemical recurrence - each patient having a set of gene frequencies. The gene frequency expression was registered for each patient, resulting in a total of 659,353 sequences representing 1 1,955 genes deposited into a RefSeq database.

[0056] Subsequently, for each patient under a Jeffreys non-informative prior, the posterior distribution of gene frequencies is approximated by a logistic normal distribution. Thus, the corresponding log-odds ratio vector has an asymptotic multivariate normal distribution with means and covariance matrix totally determined by using digamma and trigamma functions. Following the calculation of each patient's multivariate density, a pair of independent group densities for each of the genes was constructed. Taking the same average weight of individual score-densities within each group (the size of a library corresponds to its weight), the pair of group score densities to which the score of the next patient was to be referred was obtained. An operational definition of the score was, therefore, the point in the x-axis in which the densities intercepted. The choice of the k value is justified by the fact that as the value of k increases, the power of the score of the differential expression is also expected to increase. The mathematical and statistical method to obtain a group density thus follows from the definition of the weighing system that averages the densities of the patients in the group: the weight for each patient is the size of the patient's library obtained in the gene-sequencing process. Thus, this procedure has connections to a meta-analysis which composes information provided by each patient of the group.

[0057] The two independent group-densities allow one to compute the probability that a gene is more expressed in the non-recurring group. Such probabilities for all genes considered were computed and the genes were ordered according to these probability values. Genes having distinct expression between the recidivist and non-recidivist groups possess probabilities near zero or one: a probability near one in the case of Gleason 7 prostate cancer means that the gene is more expressed in this group in relation to the other; and a probability near zero means that the gene is differentially less expressed. Therefore, instead of a one-by-one gene procedure to classify a patient, the present invention provides a small, high-resolution gene composition of most differentially expressed genes. These genes will be the focus of attention in the prognostic evaluation of the next Gleason 7 patient. Further, by taking a composition of the multivariate log-odds distribution, the present invention also takes into consideration the possible dependence of expression among the genes.

[0058] Considering again each individual patient in the sample and his multivariate normal density for the log-odds vector of the differentially expressed gene, the sum of the k last components (under expressed) of the above ordered list of genes subtracted from the sum of the k first components (over expressed) yielded a score for each individual patient. This score is a linear combination of normally distributed variables; thus, it has a univariate normal distribution with known mean and variance. The same weighted average of these score-densities within each group was calculated to obtain a pair of group score densities to which the score of the next patient is to be referred. The choice of the value of k remains to be justified: one expects to increase the score differential expression power as the value of k increases. In the present example, with twelve genes (k = 6), the probability of the score in the non-recurring group being greater than its value in the recurring group is 0.9939. Note that by contemplating a composition in the multivariate log-odds distribution, that expression-dependence among genes may occur is also considered. With six genes (k = 3), the probability is 0.9873. Thus, the predictive value of the concerted expression levels of several genes reaches a plateau around 99%, at six genes.

[0059] Therefore a small, high-resolving set of six genes were discovered in the present invention that suitable for classifying samples with a predicted value, and consequently may be easily implemented in the clinical practice. These include: RPL35, RPL35a, RPS28, PODXL, PODXL1 , and SRSF5.

[0060] RPL35, and its splice variant RPL35a, have been shown to have its expression level depressed in colorectal cancer (Kasai et al., J. Histochem & Cytochem, 51 , 567-573, 2003). RPL35 has also recently been included in an (1 1) eleven-gene signature able to predict lymph node metastasis in early cervical carcinoma (Huang et al., Cancer, 1 17, 3363-3373, 201 1). RPS28 has been shown as an outcome predictor in breast cancer (Yau et al., Breast Cancer Res, 12, R85, 2010). Podocalyxin (PODXL), a bona fide target of p53, was found to be positively regulated by WT1, and its inappropriate expression contributed to Wilms tumorigenesis (Stanhope-Baker et al., J. Biol. Chem, 279, 33575-85, 2004). Regarding prostate cancer, PODXL was considered an aggressiveness marker (Casey et al., Hum. Mol. Genetics, 15, 735-41, 2006). PODXL expression was also demonstrated in intratubular germ cell neoplasia unclassified (IGCNU), seminomas, and embryonal carcinomas (Biermann et al., Anticancer Res., 27, 3091-100, 2007). Furthermore, PODXL was found to be regulated by miR-199a-5p and over-expressed in malignant testicular tumor (Cheung et al., Oncogene, 201 1). Podocalyxin-like protein 1 (PODXL1) expression was found lacking in adenocarcinomas of the lung and prostate, as well as liver metastases of colorectal carcinomas ( ey et al., Hum. Pathol., 38, 359-64, 2007). Further, herein, for the first time, is discovered a role for the differential expression of serine/arginine splicing factor 5 (SRSF5) in prostate cancer.

[0061] Using this set of six genes as "seed" markers, interactions of these genes in the human interactome were searched using only the minimal pathways to establish the connections. In order to validate these findings, taking into account the hypothesis that the small, high-quality classifier of the present invention for Gleason 7 tumors could also classify or increase resolution of other Gleason scores because the disruption provoked by this classifier is related to prostate tumorigenesis, a set of microarray data from 89 patients with prostate cancer with or without recurrence was utilized. The interactions predicted by the model of the present invention were therefore constructed based on the interactome using as the connection values the Pearson correlation between expressed genes in the recurrence group and no recurrence data sets. Upon comparison of the Pearson correlations, it was observed that within the classifier gene set, interaction with other genes also change between the groups of patients with and without biochemical recurrence. Thus, in addition to the quantitative changes induced by the classifier gene set, an examination of the dynamic structure of the human protein interaction network (interactome) evidenced substantial changes in its organization that may explain in a dynamic way differences between tumors and controls by taking into account genes that are not differentially expressed. [0062] In summary, the present invention demonstrated that tagging cDNA libraries followed by high throughput sequencing, coupled to a novel analytical tool leads to the identification of genes whose concerted expression levels can separate recurrent from nonrecurrent prostate cancer Gleason 7 patients with 99% certainty, out-performing any other method described in the literature so far. It is very important to note here that these genes were found even with a relatively small number of sequences far below average high- throughput sequencing levels. Interactome cluster analysis further points out several interacting pairs of genes/proteins that change their interaction levels between both groups of patients. These findings could lead to the identification of novel markers or drugable targets.

[0063] Linear Pool from Frequency Composition Data: Application to Gene Sequencin

[0064] In diagnosing a patient according to his digital gene expression profile - i.e., classifying the patient into one of r >1 possible health conditions - the patient is classified according to his observed vector z = (zi, Zi, ZiJ of frequencies associated to the k most differentially expressed genes. The choice of which k counts are to be considered for a diagnostic or a classifier must be made beforehand and based on vectors of frequencies of patients who had their condition previously diagnosed. These data vectors, however, have a number of principal components, g, which is much larger than k. Thus, the goal is to identify the k most valuable "tags" or gene sequences for a diagnostic or a classifier.

[0065] In one example, two alternative conditions exist: r = 2. Data, d, then consist of m and n frequency vectors from patients found to have, respectively, the first and second condition. All m+n vectors have g components corresponding to each of the g considered tags. The g tags are the same for all m+n vectors. The likelihood L generated by d is described by the following function:

for which /¾^■ (q^ is the theoretical gene expression of the y-th tag of the /-th individual under the first (second) health condition; and j , (γ is the observed count produced by the read/sample of the y-th tag of the /-th individual under the first (second) health condition.

[0066] Due to the typically very large number of tags, a Jeffreys' non-informative prior for each individual vector of parameters is used. In one example, a proper prior, the Dirichlet prior distribution with all hyper-parameters equal to ½ is used. Consequently, each individual posterior is Dirichlet with hyper-parameters x + V₂ (and y + ½). Useful properties of a Dirichlet prior distribution include: moments, marginal distributions, and transformation.

[0067] Consider a random vector W = (wo, w_h w_/J having a Dirichlet distribution with parameter-vector (ao, (ii, ), W as the following density in the simplex set:

S = {(sO,...,sk) : sj > 0, s0+...+sk = 1 } :

[0068] Taking a_# + ... + a* =A, the following properties hold: [0069] 1. Moments:

[0070] 2. Marginal distributions:

Consider a subset of / (< k) components of W: (w_h w>2, wi) and its complement (wo = Wi+i + ... + Wk). The vector (w - ... ; w_h- w₀) is distributed as a Dirichlet distribution with parameters «/, a¾ a_h ao for «ø = _/+/ + ... + α_Λ. In particular, the variable (wj; 1 - wj) or, for short, w_t has a beta distribution with parameters a_/ and A - aj = + ... + a* = a .

[0071] 3. Transformation: The distribution of the vector T is approximately multivariate normal,

with means, variances, and covariance defined as: ntj = E(tj ) = y(aj ) - ψ(α₀ ); vj = V r(t j ) = ψ' ( a j ) + ψ* ( ₀ ):

cij ^{= Cov}(^ti - ^tj ) = Ψ'(^αθΧ f,y = 1." · i≠ j.

The functions ψ and ψ' are respectively the digamma (derivative of the gamma function) and the trigamma (derivative of the digamma).

[0072] In one example, to preserve the gene expression variability among individuals, patients are taken as primary sampling units, and each individual provides a likelihood which is combined with Jeffreys^' prior to yield his gene expression profile Dirichlet posterior. To avoid the computational troubles brought by Dirichlet distributions that have large parameter values and high dimension, normal distributions are transformed. Thus, individuals / and j from group one and two respectively and taking p_i0 = pi_k+i + ... + ¾ and ¾o = qjk+i + · · · + qj_g, *e vectors (p_w; p_u; ... ; p_ik) and (q_J0; qji; q_jk) will be parameterized to:

(hi^L;-;ln^) and (1η^- · -;1ιι

fto Ρϊϋ 9/0 ^" Qjo

The posterior density of such transformed parameters is approximately Multivariate Normal with moments:

} } ^} _A J J ^J _{A + l} i - J ^J _{A + l}

Thus, for any chosen tag t used for individual i of the first group and any individual j of the second group: p_i0 = 1 -p_u and q_j0 = 1- q_j,.

[0073] In one example, to build a group-posterior in which every patient contributes while maintaining his variability, a weighted mean of individual densities for the group-density is calculated. This pooling method is pertinent to meta-analysis or synthesis-analysis contexts. The weights to be used in the linear density^'s pool are the sizes of the group individual sequencing libraries. The procedure is equally performed to both groups, leaving one with two posterior densities, one for each group. For every chosen tag / (= 1 , ... , g), x_t and y_t are considered to be independent random variables distributed according to the two group-marginal posterior densities relative to tag /, and the probability that *, >y_t ^'s consequently computed. These probabilities are then ordered according to the most differentially expressed tags - i.e., those for which the probability of JC, > y_t is closest to 0 or to 1.

[0074] The previous examples are provided solely to enable a person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein can be applied to other embodiments without departing from the spirit or scope of the invention.

Claims

CLAIMS What is claimed is:

1. A predictive method of classifying a latent biological sample according to a more or less probable phenotypic outcome, comprising:

a. selecting a phenotypic outcome;

b. obtaining a primary molecular data collection from a subject population existing as latent classes A and B having a set of distinct molecular features, wherein a subject of latent class A exhibits the selected phenotypic outcome and a subject of latent class B does not;

c. approximating a posterior distribution of molecular data frequencies for each subject in the population;

d. constructing two independent compositional data frequencies for each molecule in the data collection according to latent classes A or B;

e. comparing the compositional data frequencies of latent classes A and B for each molecule and selecting the most differentially resolving molecules into a small, high-quality classifier;

f. increasing resolution of the small, high-quality classifier by integrating the compositional data frequencies of the most differentially resolving molecules into a pair of marginal, multivariate probability densities according to latent classes A or B;

g. calculating a multivariate probability density for a latent biological sample according to the small, high-quality classifier; and

h. classifying the biological sample according to a more or less probable phenotypic outcome by comparing the sample multivariate density to the pair of marginal densities.

2. The predictive method of claim 1, wherein the phenotypic outcome comprises: a morphological, developmental, biochemical or physiological property; a behavior or a product of behavior; or combinations thereof.

3. The predictive method of claim 1 , wherein the probability I of a phenotypic outcome is generated by a function:

4. The predictive method of claim 1 , wherein the molecular data collected comprises: transcriptomic, genomic, or proteomic data, or combinations thereof.

5. The predictive method of claim 1, wherein the posterior distribution of molecular frequencies is approximated by a logistic normal distribution, logit, logits, probit, logistic function, logistic regression, log-odds, a logit-normal distribution, or equivalents thereof.

6. The predictive method of claim 5, wherein approximating a posterior distribution of molecular frequencies includes a Jeffreys' non-informative prior.

7. The predictive method of claim 5, wherein the posterior distribution of molecular frequencies is approximated by a logistic normal distribution.

8. The predictive method of claim 5, wherein a corresponding log-odds ratio vector has an asymptotic multivariate normal distribution with means and covariance matrix wholly determined by using digamma and trigamma functions.

9. The predictive method of claim 8, wherein the posterior distribution of molecular data frequencies for each subject is further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof.

10. The predictive method of claim 1 , wherein the compositional data frequencies are group probability densities, joint probability densities, multivariate densities, or equivalents thereof.

1 1. The predictive method of claim 10, wherein a corresponding log-odds ratio vector has an asymptotic multivariate normal distribution with means and covariance matrix wholly determined by using digamma and trigamma functions.

12. The predictive method of claim 1 1 , wherein the compositional data frequencies are further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof.

13. A predictive method of classifying an alternative disease state according to a more or less favorable prognosis, comprising:

a. obtaining a primary data collection;

b. statistically generating a small, high-resolution classifier from the primary data; and

c. using the classifier in a clinical setting to provide a differential diagnostic of disease.

14. The predictive method of claim 13, wherein the alternative disease state exists as latent classes A and B corresponding to a more or less favorable prognosis, and wherein the latent classes exhibit a distinct set of molecular features.

15. The predictive method of claim 13, wherein the primary data comprises: transcriptomic, genomic, or proteomic data, or combinations thereof.

16. The predictive method of claim 15, wherein the primary data collection includes an extraction of mRNA from two biologically distinct groups of patients corresponding to latent classes A and B, and a generation of cDNA libraries tagged by probes specific to individual patients.

17. The predictive method of claim 16, wherein the primary data collection further includes a qualitative sequencing of the cDNA libraries as a high-throughput batch, and a de-convolution of sequencing results based on the tags.

18. The predictive method of claim 17, wherein the high-throughput batch sequencing is a massively parallel signature sequencing, polony sequencing, parallelized pyrosequencing, reversible dye-terminator sequencing, ligation sequencing, ion semiconductor sequencing, DNA nanoball sequencing, single molecule sequencing, nanopore DNA sequencing, hybridization sequencing, microfluidic Sanger sequencing, or equivalents thereof.

19. The predictive method of claim 17, wherein each gene in the two biologically distinct group of patients is distributed differentially or identically, determining its predictive value.

20. The predictive method of claim 17, wherein a posterior distribution of molecular frequencies is approximated for each individual patient.

21 . The predictive method of claim 20, wherein the posterior distribution of molecular frequencies is approximated for each individual patient is calculated under a Jeffreys' non-informative prior.

22. The predictive method of claim 21 , wherein the posterior distribution of molecular frequencies is approximated by a logistic normal distribution, logit, logits, probit, logistic function, logistic regression, log-odds, a logit-normal distribution, or equivalents thereof.

23. The predictive method of claim 22, wherein the posterior distribution of molecular frequencies is approximated by a logistic normal distribution.

24. The predictive method of claim 23, wherein a corresponding log-odds ratio vector has an asymptotic multivariate normal distribution with means and covariance matrix wholly determined by using digamma and trigamma functions.

25. The predictive method of claim 20, wherein the posterior distribution of molecular data frequencies for each patient is further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability; or equivalents thereof.

26. The predictive method of claim 14, wherein two independent compositional data frequencies corresponding to latent classes A and B are constructed for each gene in the data collection.

27. The predictive method of claim 26, wherein the compositional data frequencies are group probability densities, joint probability densities, multivariate densities, or equivalents thereof.

28. The predictive method of claim 27, wherein the compositional data frequencies are group probability densities.

29. The predictive method of claim 27, wherein the corresponding log-odds ratio vectors have an asymptotic multivariate normal distribution with means and covariance matrix wholly determined by using digamma and trigamma functions.

30. The predictive method of claim 29, wherein the compositional data frequencies are further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof.

31. The predictive method of claim 26, wherein the group probability densities of latent classes A and B are compared for each gene, and wherein the most differentially resolving genes are selected into a small, high-quality classifier.

32. The predictive method of claim 31, wherein resolution of the small, high-quality classifier is increased by integrating the group probability densities of the most differentially resolving molecules into a pair of marginal, multivariate probability densities and corresponding log-odds vectors according to latent classes A or B.

33. The predictive method of claim 32, wherein the marginal, multivariate probability densities are further reassessed by a synergistic Bayesian probability, conditional probability, posterior probability, or equivalents thereof.

34. The predictive method of claim 31, wherein for an individual patient the cDNA expression values for the genes being members of the classifier are compared with the pairs of marginal group probability densities corresponding to a more or less favorable prognosis.

35. The predictive method of claim 13, wherein the likelihood L of a prognosis is generated by the function:

A predictive method of classifying Gleason 7 stage prostate cancer according to a more or less probable recidivism, comprising:

a. extracting mRNA from a Gleason 7 stage subject population existing as latent recidivist and non-recidivist classes A and B having a set of distinct molecular features, and generating cDNA libraries tagged by probes specific to individual patients;

b. qualitatively sequencing the cDNA libraries as a high-throughput batch, and de-convoluting the sequencing results based on the patient-specific tags; c. approximating a posterior distribution of molecular data frequencies for each subject, under a Jeffreys non-informative prior, by a logistic normal distribution;

d. constructing two independent group probability densities for each gene in the data collection according to latent classes A or B;

e. comparing the group probability densities of latent classes A and B for each gene, distributing each gene differentially or identically according to these probability densities, and selecting the most differentially resolving genes into a small, high-quality classifier;

f. increasing resolution of the small, high-quality classifier by integrating the group probability densities of the most differentially resolving genes into a pair of marginal, multivariate probability densities and corresponding log- odds ratio vectors according to latent classes A or B;

g. calculating a multivariate probability density for a next Gleason 7 stage patient according to the genes being members of the small, high-quality classifier; and

h. classifying in a clinical setting the next Gleason 7 stage patient according to a more or less probable recidivism by referring to the pairs of marginal group probability densities.

The predictive method of claim 36, wherein the genes having distinct expression between latent classes A and B possess probabilities near zero or one.

The predictive method of claim 37, wherein a probability near one means that the gene is more expressed in one latent class relative to the other, and wherein a probability near zero means that the gene is less expressed in one latent class relative to the other.

The predictive method of claim 36, wherein the small, high-quality classifier comprises a single gene.

The predictive method of claim 39, wherein the single gene is PL35.

The predictive method of claim 39, wherein the single gene is SRSF5.

The predictive method of claim 36, wherein the small, high-quality classifier comprises from about one to about twelve genes.

43. The predictive method of claim 42, wherein the small, high-quality classifier comprises genes selected from the group consisting of: RPL35, RPS28, C 12orf57, NFKBIZ, RPS15, UBA52, PNN, MTRNR2L 10, SLC25A4, PODXL, SRSF5, and LOC 100293090.

44. The predictive method of claim 42, wherein the predictive value of the classifier is about 99%.

45. The predictive method of claim 42, wherein the classifier is suitable for diagnostic use in a clinical setting.

46. The predictive method of claim 36, wherein the probability L of recidivism is generated by the function: