WO2009067655A2 - Methods of feature selection through local learning; breast and prostate cancer prognostic markers - Google Patents

Methods of feature selection through local learning; breast and prostate cancer prognostic markers Download PDF

Info

Publication number
WO2009067655A2
WO2009067655A2 PCT/US2008/084325 US2008084325W WO2009067655A2 WO 2009067655 A2 WO2009067655 A2 WO 2009067655A2 US 2008084325 W US2008084325 W US 2008084325W WO 2009067655 A2 WO2009067655 A2 WO 2009067655A2
Authority
WO
WIPO (PCT)
Prior art keywords
rbm34
rpl23
tgfb3
pak3
margin
Prior art date
Application number
PCT/US2008/084325
Other languages
French (fr)
Other versions
WO2009067655A3 (en
Inventor
Yijun Sun
Steve Goodison
Li Liu
William George Farmerie
Original Assignee
University Of Florida Research Foundation, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University Of Florida Research Foundation, Inc. filed Critical University Of Florida Research Foundation, Inc.
Publication of WO2009067655A2 publication Critical patent/WO2009067655A2/en
Publication of WO2009067655A3 publication Critical patent/WO2009067655A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • Machine learning involves the design and development of algorithms and techniques that allow computers to "learn.” The algorithms are designed to be able to improve automatically through experience. Machine learning research has been focused on the ability to extract information from data automatically by computational and statistical methods. Machine learning applications include, but are not limited to. natural language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics, cheminformatics. credit card fraud detection, stock market analysis, DNA sequence classifications, speech and handwriting recognition, object recognition in computer vision, game playing, and robot locomotion.
  • Feature selection is a fundamental problem in machine learning. With the advent of high throughput technologies, feature selection has become increasingly important in a wide range of scientific disciplines. The goal of feature selection is to extract the most relevant information about each observed datum from a potentially overwhelming quantity of its features. Here, relevant information means those features whose discriminative properties facilitate the underlying data analysis.
  • oligonucleotide microarray for the identification of cancer- associated gene expression profiles of diagnostic or prognostic value such as discussed in "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring” by Golub et al., Science, 286, 531-537 (1999), "Gene expression profiling predicts clinical outcome of breast cancer” by van't Veer el al., Nature, 415, 530-536 (2002), and "Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer” by Wang el al., Lancet, 365, 671-679 (2005).
  • the number of features (genes) associated with the raw data is in the order of thousands or even tens of thousands.
  • genes genes that are associated with the raw data.
  • only a small fraction is likely to be relevant for cancerous tumor growth and/or spread.
  • the abundance of irrelevant features poses serious problems for existing machine learning algorithms, and represents one of the most recalcitrant problems for their application in oncology and other scientific disciplines dealing with copious features.
  • wrapper ox filter methods with respect to criteria used for searching relevant features.
  • wrapper methods a classification algorithm is employed to evaluate the goodness of a selected feature subset.
  • filter methods criterion functions evaluate feature subsets by their information content, typically intcrclass distance (e.g., Fisher score) or statistical measures (e.g., t-test), rather than optimizing the performance of any specified learning algorithm directly.
  • intcrclass distance e.g., Fisher score
  • statistical measures e.g., t-test
  • wrapper methods there are a number of problems with wrapper methods.
  • One major issue with wrapper methods is their high computational complexity.
  • Many heuristic algorithms have been proposed to alleviate this computational issue.
  • a hybrid approach is usually adopted, wherein the number of features is first reduced by using a filter method, and then a wrapper method is used on the reduced feature set. Nevertheless, it still may take several hours to perform the search, depending on the classifier used in the wrapper method.
  • a simple classifier e.g., linear classifier
  • the selected features arc then fed into a more complicated classifier in the subsequent data analysis. This gives rise to the issue of feature exportability - in some cases, a feature subset that is optimal for one classifier may not work well for others.
  • a wrapper method Another issue associated with a wrapper method is the capability to perform feature selection for multiclass problems. To a large extent, this property depends on the capability of a classifier used in a wrapper method to handle multiclass problems. Some classifiers originally designed for binary problems can be naturally extended to multiclass settings, while for others the extension is not straightforward.
  • a multiclass problem is first decomposed into several binary ones by using an error-correct-code method, such as described in "Solving multiclass learning problems via error-correcting output codes," by Dietterich et al., J. Artif. Intel!, Res. , 2, 263-286, (1995), and "Unifying error-correcting and output-code AdaBoost through the margin concept," by Sun et al., Proc. 22nd Int. Conf. Mack Learn.., 872-879 (2005). Then, feature selection is performed for each binary problem. This strategy further increases the computational burden of a wrapper method.
  • Embedded methods incorporate feature selection into the learning process of a classifier.
  • a feature weighting strategy is usually adopted that uses real-valued numbers, instead of binary ones, to indicate the relevance of features in a learning process.
  • the present invention provides solutions for large-scale feature selection problems for scientific and industrial applications.
  • the systems and methods of the subject invention address and/or substantially obviate one or more problems, limitations, and/or disadvantages of the prior art.
  • the present invention provides a method for feature section incorporating decomposing a complex non-linear problem into a set of locally linear problems using local learning, and estimating feature relevance globally within a large margin framework.
  • the local learning allows one to capture local structure of the data, while the parameter estimation is performed globally to avoid possible overf ⁇ tting.
  • a method capable of handling arbitrarily complex, nonlinear problems.
  • a method of feature selection that addresses problems of computational complexity, solution accuracy, and esoteric implementation.
  • the methods for feature selection can be used to predict metastatic behavior in medical applications such as the behavior of tumors in oncology.
  • the methods for feature selection can be used for pattern classification and computer vision in computers science, and target recognition and signal classification in electrical engineering.
  • One aspect of the invention concerns a method of providing a prognosis, comprising obtaining (e.g., generating or supplying) data wherein the data comprises gene expression levels for a plurality of genes comprising or consisting of LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME.
  • the prognosis of the patient is obtained by obtaining or generating expression profiles for LOC58509, CEGPl, AL080059, ATP5E and PRAME.
  • the invention concerns a method of assigning a prognosis class to a patient, comprising: (a) receiving gene expression data or generating gene expression data, wherein the data comprises a gene expression profile for a plurality of genes selected from LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME; and (b) classifying the patient as belonging to a particular prognosis class based upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl. ⁇ L080059, ⁇ TP5E, and/or PRAME, and wherein the prognosis class is a categorizalion that is correlated with risk of cancer occurrence or recurrence.
  • Another aspect of the invention concerns a method of assigning treatment to a breast cancer patient, comprising: (a) obtaining a biological sample of the breast cancer patient; (b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes selected from LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME; (c) classifying the patient as belonging to a particular prognosis class based upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME; and (d) assigning, recommending or providing treatment to the breast cancer patient based wholly or in part on the patient's particular prognosis class.
  • Another aspect of the invention is an article comprising a plurality of means for detecting the expression of a gene, wherein the individual means for detecting are each directed to detection of the product(s) of a particular gene, and the plurality of means for detecting detects expression of a plurality of genes comprising or consisting of LOC58509, CEGPl, AL080059, ATP5E. and PRAME, or subcombinations thereof.
  • the individual means for detecting are each directed to detection of the product(s) of a particular gene
  • the plurality of means for detecting detects expression of a plurality of genes comprising or consisting of LOC58509, CEGPl, AL080059, ATP5E. and PRAME, or subcombinations thereof.
  • each said individual means for detecting is a polynucleotide probe, an antibody, a set of one or more components capable of adapting the article to implementation of a demonstrated specific method of analysis, a set of one or more components capable of adapting the article to implementation of a validated specific method of analysis, or a set of one or more components capable of adapting the article to implementation of an approved specific method of analysis.
  • the invention further relates in part to prognostic markers for prostate cancer.
  • One aspect of the invention includes prognostic signatures shown to perform comparably to and/or to outperform a common clinically-used postoperative nomogram (such as those provided on the world wide web at www.mskcc.org/mskcc/html/10088.cfm).
  • a prognostic signature based purely on gene expression information and a hybrid prognostic signature based on both clinical data and gene expression information are shown to be useful for determining accurate prostate cancer prognoses.
  • Genes and/or sequences shown to be useful prognostic markers include without limitation PCOLN3, TGFB3, PAK3, RBM34.
  • one aspect of the invention concerns a method of assisting in developing a prognosis for a patient comprising obtaining (e.g. , generating or receiving) or supplying data for a plurality of genes selected from one or more of the following genes: PCOLN3, TGFB3, PAK3. RBM34, RPL23, E124, FUT7. RICS Rho, MAP4K4, CUTLl, and/or ZNF324B or various combinations thereof and assigning a prognosis class (e.g., a "'good prognosis” class or a "bad prognosis” class) to the patient on the basis of the gene expression.
  • a prognosis class e.g., a "'good prognosis" class or a "bad prognosis” class
  • the prognosis class is a categorization that is correlated with risk of cancer occurrence or recurrence. For patients predicted to have a high risk of cancer recurrence (e.g., a "bad prognosis"), increased frequency of post operative surveillance for the recurrence of cancer is performed. For patients with a low risk of a cancer recurrence standard surveillance for the recurrence of cancer can be performed.
  • Another aspect of the invention concerns a method of assigning a prognosis class to a patient comprising: (a) obtaining (e g . receiving or generating) data relating to said patient, wherein the data comprises both a gene expression profile for a plurality of genes comprising or consisting of PCOLN3, TGFB3, PAK3, RBM34, RPL23.
  • a prognosis class e.g., a "'good prognosis" class or a "bad prognosis'” class
  • Another aspect of the invention concerns a method of assigning treatment to a prostate cancer patient comprising: (a) obtaining a biological sample from the prostate cancer patient; (b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes comprising or consisting of PCOLN3, TGFB3, PAK3, RBM34, RPL23, EI24. FUT7, RICS Rho, MAP4K4. CUTLl, and ZNF324B; (c) classifying the patient as belonging to a prognosis class based upon a subset of the gene expression profile, wherein the subset comprises gene expression levels for said plurality of genes; and (d) assigning treatment to the prostate cancer patient based wholly or in part on the patient's prognosis class.
  • Yet another aspect of the invention concerns a method of assigning treatment to a prostate cancer patient comprising: (a) obtaining a biological sample from the prostate cancer patient; (b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes comprising or consisting of PCOLN3, TGFB3, PAK3, RBM34, RPL23, EI24, FUT7, RICS Rho, MAP4K4, CUTLL and ZNF324B; (c) performing upon the patient a postoperative nomogram evaluation: (d) classifying the patient as belonging to a prognosis class based upon: (1) a subset of the gene expression profile, wherein the subset comprises gene expression levels for the plurality of genes, and (2) the result of the postoperative nomogram evaluation; and (e) assigning treatment to the prostate cancer patient based wholly or in part on the patient's said particular prognosis class.
  • Figures IA and IB show Fermat's spiral problem.
  • Figure IA shows a two- dimensional spiral shape distribution of samples belonging to two classes;
  • Figure IB shows a local learning transformation of the spiral problem into a set of locally linear problems according to an embodiment of the present invention.
  • Figure 2 shows iteratively refined estimates of weight vector and probabilities until convergence according to an embodiment of the present invention.
  • Figures 3A-3H show feature weights learned on the spiral dataset with different numbers of irrelevant features, ranging from 50 to 30000. according to an embodiment of the present invention.
  • Figure 4 shows run times of a feature selection process used on the spiral dataset according to embodiments of the present invention.
  • Figures 5A-5E show feature weights learned on the spiral dataset with different regularization parameters according to embodiments of the present invention.
  • Figures 6A-6F show feature weights learned on the spiral dataset with different kernel widths according to embodiments of the present invention.
  • Figure 7 shows a plot for convergence analysis of embodiments of the present invention.
  • Figures 8A-8G show feature weights learned using an embodiment of the present invention on seven UCI datasets.
  • Figure 9 illustrates the experimental protocol.
  • the experimental protocol consists of inner and outer loops.
  • LOOCV is performed to estimate the optimal classifier parameters based on the training data provided by the outer loop, and in the outer loop, a held-out sample is classified using the best parameters from the inner loop. The experiment is repeated until each sample has been tested.
  • Figure 10 provides receiver operating characteristic (ROC) curves of four breast cancer prognostic signatures.
  • the legends New-LDA and New-Corr refer to the prediction results of the gene signature identified according to an embodiment of the present invention, as obtained by using LDA and a correlation-based classifier, respectively.
  • Figures 11A-11D show Kaplan-Meier estimations of the probabilities of remaining distant metastases free in patients with a good or bad breast cancer prognosis, determined by the new signature according to an embodiment of the invention (HA), St. Gallcn criterion (1 IB), 70-gcne signature (1 1C), and hybrid signature (1 ID).
  • the p-value is computed by the use of log-rank test.
  • Figure 12 shows feature weights of genes identified according to embodiments of the present invention.
  • Figure 13 shows run time of a feature selection process used to identify a gene signature for a breast cancer dataset according to embodiments of the present invention.
  • Figure 14 presents receiver operating characteristic (ROC) curves comparing the prediction performance of the nomogram, genetic, and hybrid models for prostate cancer prognosis.
  • ROC receiver operating characteristic
  • Figures 15A-C shows Kaplan-Meier estimation of the probabilities of remaining biochemical recurrence free for patients with a good or bad prostate cancer prognosis, determined by using the hybrid (3A) nomogram (3B), and microarray (3C) models. The p- value is computed in each case by use of the log-rank test.
  • Figures 16A-K show scatter plots of prostate cancer gene markers demonstrating clear up- or down-regulation between patients with and without biochemical recurrence.
  • the present invention provides methods for feature selection in problems across many disciplines dealing with copious features, including but not limited to bioinformatics, economics, and computer vision.
  • a feature selection algorithm is provided that can be used in embodiments of the present invention.
  • the feature selection algorithm can be used to address issues with prior work, including the problems with computational complexity, solution accuracy, esoteric implementation, capability of handling an extremely large number of features, exportability of selected features, and extension to multiclass problems.
  • an arbitrarily complex, nonlinear problem can be decomposed into a set of locally linear problems through local learning, and the relevance of features can be estimated globally within a large margin framework.
  • Local learning allows one to capture local structure of the data, while the global parameter estimation allows one to avoid, or reduce, possible overfitting.
  • the local learning relates to computational systems.
  • a model learns by slowly changing a set of parameters, called weights, during presentations of input data.
  • the success of learning is measured by an error function. Learning is completed when the weights are sufficiently close to the values that minimize the error function. Under the best circumstances, the weights become successively closer to the optimal weights with additional training. This behavior can also be considered as convergence.
  • the local learning equations slowly change the weights until the error function is minimized.
  • the ability to perform local learning allows for use of local knowledge that can then be affected by global information.
  • embodiments of the present invention have many advantages.
  • the algorithm of the present invention is capable of processing many thousands of features within a few minutes on a personal computer, while maintaining a close-to-optimum accuracy that is nearly insensitive to a growing number of irrelevant features. Due to local learning, wherein no assumption is made about data distributions, selected feature subsets exhibit excellent exportability to different classifiers. Unlike most prior work whose implementation demands an expertise in machine learning, embodiments of the present invention are very easy to implement.
  • the subject algorithm can be coded in MatlabTM with less than one hundred lines.
  • the subject algorithm can have two user defined parameters that can alternatively be estimated through cross validation. According to an exemplary embodiment, the tw r o user defined parameters are kernel width and regularization parameters. The performance of embodiments of the subject algorithm is largely insensitive to a wide range of values for these two parameters, which makes parameter tuning and hence the implementation of the algorithm easy in real applications.
  • the margin is defined for the training dataset. Given a particular distance function, the two nearest neighbors of each sample X n . one from the same class (called nearest hit or NH), and the other from the different class (called nearest miss or NM) can be found ⁇ see "A practical approach to feature selection " ' by Kira et al., Proc. 9 th Int. Conf. Mack Learn., 249-256 (1992)).
  • each feature is scaled, and thus a weighted feature space can be obtained, parameterized by a nonnegative vector w, so that a margin-based criterion function in the induced feature space is maximized.
  • the parameterization can be a linking of the scaled features to a feature weight vector.
  • the magnitude of each element of w in the above margin definition reflects the relevance of the corresponding feature in a learning process.
  • the margin thus defined requires only information about the neighborhood of X n , while no assumption is made about the underlying data distribution.
  • the well-known Fermat's spiral problem is considered as illustrated in Fig. I A. Referring to Fig. IA, samples belonging to two classes are distributed in a two-dimensional space, forming a spiral shape. Local learning transforms the nonlinear spiral problem into a set of locally linear problems. By projecting the transformed data z n onto the feature weight vector w. depicted in Fig. IB. it can be seen that most samples have positive margins.
  • the margin can be estimated through taking the expectation of /? n (w) by averaging out the latent variables: p n - x J]- E,_ Hm [x n - xj]
  • the problem of learning feature weights can be directly solved within a large margin framework.
  • Two most popular margin formulations are SVM (see “Statistical Learning Theory” by Vapnik (Wiley, New York, 1998)) and logistic regression (see “Pattern Recognition and Machine Learning” by Bishop (Springer. 2006)). Due to the nonncgative constraint on w, the SVM formulation represents a large-scale optimization problem. For computational convenience, the estimation is performed in the logistic regression formulation, which leads to the following optimization problem:
  • a recursion method can be used to solve for w.
  • z n is first computed by using the previous estimate of w, which is then updated by solving the optimization problem of equation (7). The iterations are carried out until convergence.
  • equation (7) is a constrained convex optimization problem.
  • w v 2 , 1 ⁇ j ⁇ J.
  • the main complexity of the algorithm comes from computing pairwise distances between data samples.
  • the computational complexity of the algorithm is 0(TN 2 J), where T is the number of iterations, J is the feature dimensionalit) ' , and N is the number of data samples.
  • embodiments of the subject algorithm can be utilized in methods, including, but not limited to detecting or prognosticating a disease or condition, methods for pattern recognition and identification, natural language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics. chcminformatics, credit card fraud detection, stock market analysis, DNA sequence classifications, speech and handwriting recognition, object recognition in computer vision, game playing, and robot locomotion.
  • the dataset can include data obtained from an image or sound recording of a physical object.
  • a simulation study is performed on an artificially generated dataset, carefully designed to verify various important properties of the algorithm.
  • This example also called Fermat's spiral problem, is a binary classification problem.
  • Each class has 230 samples distributed in a two-dimensional space, forming a spiral shape, as illustrated in Fig. IA.
  • each sample can be represented by a varying number of irrelevant features, where this number is set to ⁇ 50, 100, 500, 1000, 5000, 10000, 20000, 30000 ⁇ .
  • the number 30.000 exceeds the amount of features experienced in most scientific fields. For example, human beings have about 25,000 genes, and hence nearly all gene expression microarray platforms have less than 25,000 probes.
  • the added irrelevant features are independently sampled from the zero-mean and unit-variance Gaussian distribution.
  • the task for this simulation is to identify the first two relevant features. Note that only if these two features are used simultaneously can the two classes of samples be well separated. Most filter and wrapper approaches perform poorly on this example, since in the former the goodness of each feature is evaluated individually, while in the latter the search for relevant features is performed heuristically.
  • Fig. 2 illustrates the dynamics of the algorithm performed on the spiral data with 10.000 irrelevant features.
  • Each sample is colored according to its probability of being the nearest miss or hit of a given sample indicated by the black cross in Fig. 2. With a uniform initial point, the nearest neighbors defined in the original feature space can be completely different from the true ones.
  • the plot also shows that the algorithm converges to a perfect solution in just three iterations.
  • Figs. 3A-3H present the feature weights learned by the algorithm on the spiral data for a varying number of irrelevant features.
  • the results arc obtained for parameters ⁇ and ⁇ set to 2 and 1 , respectively, while the same solution holds for a wide range of other values of kernel widths and regularization parameters (insensitivity to a specific choice of these parameters will be discussed shortly).
  • Figs. 3 A-3H show that the subject algorithm performs remarkably well over a wide range of feature-dimensionality values that are of practical interest - while using the same parameters. Also note that the feature weights learned across all feature- dimensionality values are almost identical lhe algorithm is computationally very efficient.
  • Fig. 4 shows the CPU time it takes the algorithm to perform feature selection on the spiral dataset with different numbers of irrelevant features.
  • the computer setting is Pentium4 2.80 GHz with 2.00 GB RAM.
  • the algorithm runs for only 3.5 seconds for the problem with 100 features, 37 seconds for 1000 features, and 372 seconds for 20,000 features.
  • the computational complexity is linear with respect to the feature dimensionality.
  • the existing wrapper methods do not have computational complexity even close to the subject algorithm. Depending on the classifier used in search for relevant features, it may take several hours for a wrapper method to analyze the same dataset with only 1000 features, and yet there is no guarantee that the optimal solution will be reached, due to heuristic search.
  • the kernel width ⁇ and the regularization parameter ⁇ are two input parameters of the algorithm. Alternatively, they can be estimated through cross validation on training data. It is well-known that cross validation may produce an estimate with a large variance. Fortunately, this does not pose a serious concern for the subject algorithm.
  • Figs. 5A-5E and 6A-6F the feature weights learned with different kernel widths and regularization parameters are plotted. As shown by these plots, the algorithm performs well over a wide range of parameter values, yielding the largest weights for the first two relevant features, while the other weights are significantly smaller. This suggests that the algorithm's performance is largely insensitive to a specific choice of parameters ⁇ and ⁇ , which makes parameter tuning and hence the implementation of the algorithm easy, even for researchers outside of the machine learning community.
  • An operator T : U —> Z is called a contraction operator if there exists a constant q ⁇ [0,1) such that
  • Theorem 2 (Fixed Point Theorem). Let T be a contraction operator mapping a complete subset U of a normed space Z into itself. Then the sequence generated as
  • the fixed point theorem is used to prove that the subject algorithm converges to a unique fixed point.
  • the gist of the proof is to identify a contraction operator for the algorithm, and make sure that the conditions of Theorem 2 are met.
  • w),i > (x 7 NH(x n )
  • the theorem ensures the convergence of the algorithm if the kernel width is properly selected. This is a very loose condition, as the empirical results show that the algorithm always converges for a sufficiently large kernel width (see Fig. 7). Also, the error bound in equation (14) indicates that the smaller the contraction number, the tighter the error bound and hence the faster the convergence rate. The experiments suggest that a larger kernel width yields a faster convergence. Unlike many other machine learning algorithms (e.g., neural networks), the convergence and the solution of the subject algorithm are not affected by the initial value if the kernel width is fixed.
  • Embodiments of the present invention can be used to perform feature selection for multiclass problems.
  • Some existing feature selection algorithms, originally designed for binary problems, can be naturally extended to multiclass settings, while for others the extension is not straightforward.
  • the extension largely depends on the capability of a classifier to handle multiclass problems.
  • a multiclass problem is first decomposed into several binary ones by using an error-correct-code method, and then feature selection is performed for each binary problem.
  • This strategy further increases the computational burden of embedded and wrapper methods.
  • the algorithm according to the present invention is based on local learning, and hence does not suffer from this problem.
  • IMs section presents feature selection results using the subject algorithm on seven benchmark UCI datasets obtained from the University of California at Irvine Repository of Machine Learning Databases, including banana, waveform, twonorrn, thyroid, heart, diabetics, and splice. The information about each dataset is summarized in Table 1.
  • the set of original features is augmented by a total of 5000 artificially generated irrelevant features, which is indicated in the parentheses in Table 1.
  • the irrelevant features are independently sampled from a Gaussian distribution with zero-mean and unit-variance. It should be noted that some features in the original feature sets may be irrelevant or weakly relevant, and hence may receive zero weights in an embodiment of the subject algorithm. Since the true relevance of the original features is unknown, to verify that an embodiment of the subject algorithm does indeed discover all relevant features, the classification performance of SVM (with the Radial Basis Function (RBF) kernel) is compared in two cases: (1) when only the original features are used, and (2) when the features selected by the subject algorithm are used.
  • RBF Radial Basis Function
  • the structural parameters of SVM are estimated through ten-fold cross validation on training data.
  • the stopping criterion is ⁇ - 0.01.
  • the algorithm is run 10 times for each dataset. In each run, a dataset is randomly partitioned into training and test sets. The averaged classification errors and standard deviations (%) of SVM are reported in Table 2.
  • the false discovery rate represents the ratio between the number of artificially added, irrelevant features with non-zero weights and the total number of irrelevant features (i.e., 5000).
  • Feature weights, learned in one sample trial, for each of seven datasets, are shown in Fig. 8.
  • the dashed lines indicate the number of original features.
  • the weights plotted on the left side of the dashed line are associated with the original features, while those on the right are associated with the additional 5000 irrelevant features. From these experimental results, the following can be observed:
  • the subject algorithm is based on local learning, wherein no assumption is made about the underlying data distribution. Consequently, the features selected by the subject algorithm exhibit very good exportability.
  • KNN K-Nearest Neighbor
  • C4.5 C4.5 are two of the most popular classifiers used in real applications.
  • the experimental protocol is exactly the same as before.
  • the averaged classification errors and standard deviations (%) of the two classifiers are reported in Tables 3 and 4, respectively.
  • the classification results of KNN and C4.5 using all features i.e., original and 5000 irrelevant features
  • the classification errors obtained by using the selected features are similar, or even slightly better than those obtained by using the original feature sets. This validates not only the high accuracy of the subject algorithm in estimating the relevance of features, but also that this estimation is not based on criteria suited for any particular classifier.
  • Breast cancer is the second most common cause of death from cancer among women in the United States. In 2007, it is estimated that about 178,400 new cases of breast cancer will be diagnosed, and 40,400 women are expected to die from this disease (data from American Cancer Society, 2007).
  • a major clinical problem of breast cancer is the recurrence of therapeutically resistant disseminated disease.
  • Adjuvant therapy chemotherapy and hormonal therapy
  • Being able to predict disease outcomes more accurately will help physicians make more informed decisions regarding the necessity of adjuvant treatment, and will lead to the development of individually tailored treatments with an enhanced efficacy. Consequently, this would ultimately contribute to a decrease in overall breast cancer mortality, a reduction in overall heath care costs, and an improvement in patients' quality of life by avoiding unnecessary treatments and their related toxic side effects.
  • Microarray technology by monitoring the expression profiling of thousands of genes in a tissue simultaneously, has been shown to provide higher levels of accuracy than the current clinical systems.
  • 70- and 76-gene breast cancer prognostic signatures have been derived by van't Veer et al. ⁇ Nature 2002) and Wang et al. ⁇ Lancet 2005), respectively, achieving a much higher specificity, that of 50%, than the current clinical systems at the same sensitivity level. These results are considered technological in breast cancer prognosis.
  • a large-scale clinical validation study involving thousands of breast cancer patients is currently being conducted in Europe to evaluate the prognostic value of the proposed signatures.
  • both gene signatures are derived based on some simple computational algorithms, which leave much room for improvement by using advanced machine learning algorithms, as demonstrated in this section.
  • T he term "'detector" refers a component or a set of components that participates in the act of detecting an analyte.
  • a detector that is specific for a particular analyte target is a component or set of components that is specialized in its capability for detecting; in the environment where employed, the main or principal capability of such a detector is detecting the particular analyte target.
  • a calcium-specific hollow cathode lamp in an atomic absorption spectrometer is a detector specific for calcium; such a lamp produces specialized wavelengths of light that are principally useful in determining how much calcium is in a sample.
  • a computerized analytical instrument system may contain a set of hardware and/or software components that: (i) displays a choice of analytes on a screen, (ii) registers a user choice of particular analyte ''X", and (iii) in response, produces a signal that adapts the computerized analytical instrument system such that its main or principal capability is detecting analyte "X".
  • RNA and DNA oligonucleotides may be employed as detectors that are specific for their corresponding complementary sequences.
  • an antibody may be used to specifically detect protein(s) displaying a complimentary epitope.
  • Expression level means an amount or concentration of gene product resulting from expression of a gene.
  • An abnormally high or abnormally low expression level for a gene product may be indicative in some cases of a state of disease or disorder, in this case breast cancer.
  • “Expression level” encompasses qualitative characterizations such as "presence” or '"absence” of a gene product in a sample.
  • Expression profile means a set of one or more expression levels.
  • RNA comprising the entire transcriptome of a cell sample may then be labeled for detection and hybridized against the unique DNA features on the chip.
  • the amount of labeled RNA bound to a particular DNA feature on the chip then corresponds to the expression level of mRNA that is complementary to that DNA probe.
  • Affymetrix typically uses DNA 25-mers as immobilized probes.
  • cDNA microarrays reverse transcription - polymerase chain reaction (RT-PCR), serial analysis of gene expression (SAGE), and branched nucleic acid methods such as Bayer's QuantiGene.
  • RT-PCR reverse transcription - polymerase chain reaction
  • SAGE serial analysis of gene expression
  • branched nucleic acid methods such as Bayer's QuantiGene.
  • Gene product As used herein, ''gene product" (or, equivalently, the '"product of a gene”) is intended to be understood as commonly used by skilled practitioners to refer to a biochemical product expressed from a gene.
  • a gene may be transcribed to yield mRNA, and the niRNA may then be translated to give polypeptide. Both the mRNA and the polypeptide are gene products of the gene.
  • a cDNA molecule arising from the reverse transcription of mRNA can also be considered a gene product within the context of this invention.
  • Some genes, for example rRNA genes and tRNA genes, may have an RNA gene product but not a polypeptide gene product.
  • isolated or biologically pure refers to material that is substantially or essentially free from components which normally accompany the material as it is found in its native state.
  • a "good prognosis” is defined as a low likelihood/probability of cancer recurrence in a patient or as a high probability of having disease-free survival for a period of time post-treatment, typically 5 or 7 years.
  • a "bad prognosis” is defined as a high likelihood/probability of cancer recurrence in a patient or as a low probability of having disease-free survival for a period of time post-treatment, typically 5 or 7 years.
  • a set of four to five genes are identified, namely LOC58509, CEGPl, AL080059. ATP5E, and PRAME. that enable highly-accurate predictions of cancer recurrence in breast cancer patients.
  • PRAME can be omitted as a prognostic indicator of breast cancer recurrence.
  • any combination of two or more of the five genes may be used, along with up to 0. 5, 10, 25, 50, 100, 250, 500. 1000, 1500, or 2000 additional genes.
  • CEGPl, AL080059, ATP5E, and PRAME may be used, along with up to 0, 5. 10, 25, 50, 100, 250, 500, 1000, 1500, or 2000 additional genes.
  • EXEMPLARY EMBODIMENTS RELATING IN PART TO BREAST CANCER lhe invention includes, but is not limited to. the following embodiments: 1.
  • a method of assigning a prognosis class to a patient comprising: a) obtaining (e.g , receiving or generating) data relating to said patient, wherein the data comprises or consists of a gene expression profile for a plurality of genes comprising or consisting of LOC58509, CEGPl.
  • the patient as belonging to a particular prognosis class based (e.g., a "good prognosis” class or a '"bad prognosis” class) upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME. and wherein the prognosis class is a categorization that is correlated with risk of cancer occurrence or recurrence.
  • a particular prognosis class based (e.g., a "good prognosis" class or a '"bad prognosis” class) upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME.
  • the prognosis class is a categorization that is correlated with risk of cancer occurrence or recurrence.
  • a method of assigning treatment to a breast cancer patient comprising: a) obtaining a biological sample from the breast cancer patient; b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes comprising or consisting of LOC58509, CEGPl. AL080059, ATP5E, and/or PRAME; c) classifying the patient as belonging to a prognosis class based upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME; and d) assigning treatment to the breast cancer patient based wholly or in part on the patient ' s prognosis class.
  • An article of manufacture comprising: a) an individual means for detecting expression products of the CEGPl, AL080059, TP5E and/or PRAME genes, wherein said means comprises, consists essentially of, or consists of polynucleotides that hybridize to said gene products; and b) the plurality of individual means for detecting the product(s) of a set of genes selected from LOC58509.
  • CEGPl, AL080059, TP5E and/or PRAME comprising: a) an individual means for detecting expression products of the CEGPl, AL080059, TP5E and/or PRAME genes, wherein said means comprises, consists essentially of, or consists of polynucleotides that hybridize to said gene products; and b) the plurality of individual means for detecting the product(s) of a set of genes selected from LOC58509.
  • CEGPl, AL080059, TP5E and/or PRAME comprising: a) an individual means for detecting expression products
  • each individual means for detecting is a polynucleotide probe or a polynucleotide probe.
  • the probe may be immobilized on a solid support.
  • the solid support can comprise, consist essentially of or consist of 8. 9, 10, I I , 12, 13, 14, 115. 16. 17, 18, 19, 20, 21, 22. 23. 24, 25, 26, 27, 28. 29, 30, 31, 32. 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63.
  • fragment' refers to a polynucleotide that is a consecutive span of nucleotides within the gene (e.g., LOC58509, CEGPl. AL080059, ATP5E and/or PRAME) that is smaller than the total length of the nucleotides encoding the gene (as exemplified by any of the attached sequences).
  • any commercially available solid supports to which polynucleotides encoding LOC58509, CEGPl, AL080059, ATP5E and/or PRAME are immobilized.
  • a kit comprising from one to five containers, each container comprising a buffer and at least one polynucleotide encoding LOC58509. CEGPl , AL080059. ATP5E and, optionally, PRAME or fragments of said polynucleotide that hybridize with gene products of LOC58509, CEGPl, AL080059, ATP5E and, optionally, PRAME.
  • the polynucleotide can comprise, consist essentially of or consist of 8.
  • kit according to embodiment 5 wherein said kit contains fewer than five containers and each container comprises a buffer and two. three, four, or five polynucleotides encoding LOC58509, CEGPl. ⁇ L080059. A1P5E and, optionally, PRAME. polynucleotides that hybridize with EOC58509. CEGPl. AL080059, ATP5E and, optionally, PRAME or fragments of said polynucleotides that hybridi/e with gene products of LOC58509, CEGPl, AL080059, ATP5E and. optionally. PRAME.
  • Another aspect of the invention provides a method for determining a prognosis in a subject, comprising detecting the presence of a combination of gene products (e.g.. mRNA transcripts or cDNA molecules) originating from LOC58509, CEGPl. AL080059. AlTSE, and FRAME (within this application, the term '"gene product(s)" may be substituted with "biomarker(s)' " ). These gene products are measured in a biological sample from the subject, wherein the presence of the gene product, or a level (e.g., concentration) of the gene product above or below a pre-determined threshold, is indicative of a particular prognosis. In some embodiments, all four biomarkers (LOC58509.
  • gene products e.g.. mRNA transcripts or cDNA molecules
  • CEGPl , AL080059 and ATP5E are examined).
  • all five biomarkers LOC58509. CEGPl, AL080059, ATP5E. and PRAME
  • Detection can be quantitative, qualitative, or semi-quantitative.
  • the invention includes a method for prognostic evaluation of a subject having, or suspected of having, cancer, comprising: a) determining the level(s) of cancer biomarkers in a sample obtained from the subject; b) comparing the level(s) determined in step (a) to level(s) of the cancer biomarker(s) known to be present in samples obtained from previous cancer patients for whom outcomes are known; and c) determining the prognosis of the subject based on the comparison of step (b).
  • the methods, devices, and kits of the invention can be used for the analysis of cancer prognosis, disease progression and mortality.
  • increased levels or decreased levels of detected biomarker in a sample compared to a standard may be indicative of advanced disease stage, residual tumor, and/or increased risk of disease progression and mortality.
  • the methods, devices and kits detect gene products of LOC58509, CEGPl, AL080059, ATP5E and/or FRAME or fragments thereof.
  • the methods utilize an expression profile based upon a four gene or five gene signature (e.g., a) LOC58509, CEGPl, AL080059 and ATP5E or b) EOC58509, CEGPl. AL080059, ATP5E, and PRAME)) as set forth in the following Table 5 : Tabic 5.
  • a four gene or five gene signature e.g., a) LOC58509, CEGPl, AL080059 and ATP5E or b) EOC58509, CEGPl. AL080059, ATP5E, and PRAME
  • CEGP1(1O643) Homo sapiens CEGP 1 protein (CEGPl), Underexpressed mRNA.
  • PRAME(8776) Preferentially expressed antigen in melanoma Overexpressed
  • Nucleic acids including naturally occurring nucleic acids, oligonucleotides, antisense oligonucleotides, and synthetic oligonucleotides, that hybridize to the nucleic acid biomarkers of the invention (e.g., that hybridize to polynucleotide gene products corresponding to LOC58509, CEGPl, AL080059, ATP5E, and PRAME) are useful as agents to detect the presence of biomarkers of the invention in the biological samples of cancer patients or those at risk of cancer.
  • oligonucleotides comprising 8, 9, 10, 11, 12, 13, 14, 115, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33. 34, 35, 36, 37, 38, 39, 40, 41 , 42, 43, 44, 45, 46, 47, 48, 49, 50, 51 , 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80.
  • oligonucleotides for use in detecting LOC58509, CEGPl, AL080059, ATP5E and PRAME in samples are those that range from 9 to 50 nucleotides.
  • a gene chip comprising or consisting of the various combinations of nucleic acids disclosed above (or herein) or nucleic acid sequences that hybridize to the disclosed nucleic acids.
  • Certain embodiments provide an array of 2000 or fewer nucleic acid sequences which contain the 4 gene combination or 5 gene combination (sets of genes) disclosed above.
  • the nucleic acids attached to the substrates of the gene chips can range from at least 8 consecutive nucleotides to the full length of any sequence disclosed herein.
  • Oligonucleotides of the prognostic signature genes described herein can also be attached to a substrate for the production of a gene chip.
  • a gene chip according to the invention comprises an array of 2000, 1500.
  • nucleic acid sequences which include those nucleic acids or oligonucleotides disclosed herein, are at least 8 consecutive nucleotides in length and are no longer than the complete sequence for a given SEQ ID NO:.
  • a gene chip array of this invention contains 4 or 5 of the disclosed nucleic acid/oligonucleotide sequences (or those that are fully complementary thereto) and no more than 2000, 1500, 1000, 500 or 100 discrete nucleic acid sequences.
  • An "array” represents an intentionally created collection of nucleic acid sequences or oligonucleotides that can be prepared either synthetically or biosynthetically.
  • the term "array " ' herein means an intentionally created collection of polynucleotides or oligonucleotides attached to at least a first surface of at least one solid support wherein the identity of each polynucleotide or oligonucleotide at a given predefined region is known.
  • the subject invention provides a gene chip comprising an array of 2000 or fewer nucleic acid sequences or oligonucleotides, provided that said array includes the four-gene or five-gene prognostic signature disclosed above and said nucleic acid sequences or oligonucleotides are at least 8 consecutive nucleotides in length.
  • the gene chips or arrays discussed herein can, in certain embodiments, contain anywhere from 8 to 50 (or more) consecutive nucleotides of the genes provided by the gene signature disclosed herein (or 8 to 50 or more consecutive nucleotides of a complementary sequence).
  • BREAST CANCER EXAMPLE GENE SIGNATURE FOR PREDICTING RISK OF DISTANT RECURRENCE OF BREAST CANCER
  • LDA linear discriminant analysis
  • Fig. 9 depicts this experimental procedure.
  • the experimental protocol consists of inner and outer loops.
  • LOOCV is performed to estimate the optimal classification parameters based on the training data provided by the outer loop.
  • the held-out sample is classified by using the best parameters from the inner loop. The experiment is repeated until each sample has been used for testing.
  • the predictive values of the newly identified gene signature are demonstrated by comparing its performance with those of the St. Gallen criterion, the 70-gene signature of van't Veer et al. (Nature 2002), and the inventors' previously-derived hybrid signature which combines both clinical and genetic information (Sun et al., 2007a).
  • the results of van ' t Veer et al. (Nature 2002) are reproduced by closely following their experimental procedure, wherein the top 70 genes are first identified, and then their predictive value is assessed by using a correlation based classifier.
  • the performance of the 70-gene signature is also evaluated through LOOCV. where the held-out testing sample is not involved in the training process.
  • Fig. 10 presents a comparison between the receiver operating characteristic (ROC) curves of the 70-gene signature, the hybrid signature, and the gene signature obtained according to an embodiment of the present invention.
  • a correlation based classifier is also applied to the gene signature obtained according to this embodiment.
  • the legend New-LDA refers to the prediction performance of the gene signature according to the subject embodiment using LDA.
  • the legend New-Corr refers to the prediction performance of the gene signature according to the subject embodiment using a correlation based classifier. It can be observed that the gene signature obtained according to this embodiment significantly outperforms the 70-gene signature and the hybrid signature.
  • the 70-gene signature and hybrid signature each significantly outperform St. Gallen criterion, whereas the gene signature obtained according to the present invention improves the specificities of the 70-gene signature and St. Gallen criterion by about 40% and 70%, respectively. This appears to be the best prediction result for this dataset reported in the literature to date. fable 6 also presents the odds ratio of the two approaches for developing distant metastases within five years between the patients with a good prognosis and the patients with a bad prognosis. It can be observed that the gene signature according to the present invention gives a much higher odds ratio (66.0, 95% confidence interval (CI): 18.0 - 242.0) than either the 70-gene or hybrid signature. The difference is more than one error bar, and hence is statistically significant.
  • CI 95% confidence interval
  • Figs. HA, HB, HC, and HD show Kaplan-Meier estimation of the probabilities of remaining distant metastases free in patients with a good or bad prognosis, determined by each approach.
  • the p-value is computed by the use of a log rank test.
  • the Mantel-Cox estimation of hazard ratio of distant metastases within five years for the new signature is 21.4 (95% CI: 7.3 - 63.0, p-value ⁇ 0.001), which is much larger than the 6.0 (95% CI: 2.0 - 17.0) of the 70- gene signature and the 11.1 (95% CI: 3.9-31.5) of the hybrid signature.
  • each iteration in LOOCV may generate different gene signatures, since training data is different (see Simon, R. "Roadmap for developing and validating therapeutically relevant genomic classifiers" J Clin. Oncol., 23. 7332-7341, 2005).
  • the majority of LOOCV iterations identify the same gene signature that consists of only four gene markers, which are reported in Table 7.
  • a fifth gene, PRAME. is identified in a minority of iterations and is also included in the new signature.
  • the new gene signature is markedly shorter than the 70-gene signature.
  • Fig. 12 plots the feature weights learned on one iteration of LOOCV according to an embodiment of the present invention.
  • the genes presented along the x-axis are arranged based on the p-values of t-test in a decreasing order. For example, the first gene contains the most discriminant information according to the t-test. Some of the top-ranked features in the t-test are not selected in the gene signature by the subject algorithm. One possible explanation is that these excluded genes are redundant with respect to the identified gene signature.
  • CEGPl and AL080059 are listed in the 70-gene prognosis signature.
  • No specific correlation of ATP5E subunit expression with cancer has been reported, but molecular defects that impinge on mitochondrial energy transduction do play a relevant role in the etiology of cancer by a number of mechanisms, including excessive reactive oxygen species production and metabolic stress-induced signaling that can enhance cellular invasive phenotypes (Amuthan et al., 2001). Accordingly, a number of metabolic markers, including Fl-ATP synthase subunits, have been suggested as potential prognostic indicators.
  • b-Fl-ATPase beta-catalytic subunit of the Fl-ATP synthase
  • a proteomics examination of a large series of breast carcinomas showed that the alteration of the mitochondrial proteome, and specifically ATPase subunits, is a hallmark feature of breast cancer, and the expression level of b-Fl -ATPase allowed the identification of a subgroup of breast cancer patients with significantly worse prognosis (Isidore et al., 2005).
  • LOC58509 gene sequence Information on the LOC58509 gene sequence is minimal to date.
  • the sequence was originally identified in a cDNA library from brain, and is also known as C19orf29 and NY- REN-24. The latter name was assigned to the sequence because it was also identified in cDNA expression libraries derived from human tumors screened with autologous antibodies derived from renal cancer patients (Scanlan et al., 1999). No functional analysis for this gene product is available.
  • TSPY-like 5 TSPY-like 5
  • NAPs act as molecular chaperones, shuttling histones from their site of synthesis in the cytoplasm to the nucleus.
  • Histone proteins are involved in regulating chromatin structure and accessibility and therefore can impact gene expression (Rodriguez et al., 1997), thus, a role in a tumor cell phenotype can be proposed.
  • the CEGPl gene (also known as SCUBE2) is located on human chromosome 1 IpI 5 and has homology to the basic helix-loop-helix (bHLH) family of transcription factors.
  • the biological role of CEGPl /SCUBE2 is unknown, but the gene encodes a secreted cell-surface protein containing EGF and CUB domains (Yang et al., 2002).
  • the EGF motif is present in many extracellular proteins that play an important role during development, and the CUB domain is found in several proteins involved in the regulation of extracellular process such as cell-cell communication and adhesion (Grimmond et al., 2000).
  • CEGP1/SCUBE2 has been reported to be associated with estrogen receptor status in breast cancer specimens (Abba et al., 2005). Furthermore, expression of CEGP1/SCUBE2 has been detected in vascular endothelium and so may play important roles in angiogenesis (Yang et al., 2002).
  • PRAME is a cancer-testis antigen (CTA), a group of tumor-associated antigens that represent possible target proteins for immuno-therapeutic approaches. Their expression is high in a variety of malignancies but is negligible in healthy tissues (Juretic et al., 2003). PRAME expression is evident in a large variety of cancer cells, including melanoma (Ikcda et al., 1997), squamous cell lung carcinoma, renal cell carcinoma, and acute leukemia (Matsushita et al.. 2003).
  • Fig. 13 presents the CPU time it takes the subject feature selection algorithm to identify a gene signature for the breast cancer dataset with a varying number of genes, ranging from 500 to 24,481. it only takes about 22 seconds to process all 24,481 genes. If a filter method is first used to reduce the feature dimensionality to. say 2000, as is almost always done in microarray data analysis, the subject algorithm runs for only about two seconds.
  • the subject algorithm runs several orders of magnitude faster than some state-of-the-art algorithms.
  • the RFE algorithm is the well-known feature selection method specifically designed for microarray data analysis. It has been reported that the RFE algorithm (with the linear kernel) takes about 3 hours to analyze 72 samples with 7129 features.
  • Another algorithm referred to as Optimal Feature Weighting (see Gadat et a "A stochastic algorithm for feature selection in pattern recognition "1 J. Mack Learn. Res., 8, 509- 547. 2007), with a C++ compiler, takes about one hour on a dataset with 72 samples and 3859 features.
  • the CPU times of both RFE and OFW are obtained on a leukemia dataset, which has similar characteristics as the breast cancer dataset.
  • Prostate cancer is the most common male cancer by incidence, and the second most common cause of male cancer death in the United States. In 2007, it is estimated that approximately 218,890 new cases will be diagnosed and 27,050 men will die from this disease (data from the National Cancer Institute). The mortality rate for prostate cancer is declining due to improvements in earlier detection and in local therapy strategies. However, the ability to predict the metastatic behavior of a patient's cancer, as well as to detect and eradicate disease recurrence, remains one of the greatest clinical challenges in oncology. It is estimated that 25-40% of men undergoing radical prostatectomy will have disease relapse. The advent of microarray gene expression technology has greatly enabled the search for predictive disease biomarkers. By monitoring the expression profiling of tens of thousands of genes in a tissue simultaneously, transcriptional profiling can provide a wealth of data for correlation with disease status.
  • nomogram refers to any prostate cancer prognosis predictor known in the art that accepts at least one input chosen from the set comprising prostate specific antigen level, prostate specific antigen doubling time, primary Gleason grade, secondary Gleason grade, sum of Gleason grades, clinical tumor stage according to a standardized clinical staging system, and the number of positive biopsy cores in conjunction with the number of negative biopsy cores.
  • prognostic signature Genes for which expression level was predictive of recurrence or non-recurrence of prostate cancer have been identified and converted into a prognostic signature (prognostic signature).
  • Statistical analysis demonstrates that a cancer prognostic signature limited to expression levels of a relatively small set of identified prognostic marker genes (PCOLN3, TGFB3, PAK3, RBM34, RPL23, E124, FUT7, RlCS Rho, MAP4K4, CUTLl, ZNF324B and various combinations thereof).
  • the expression levels of these genes can be used as a predictor of cancer recurrence.
  • This prognostic signature can perform comparably to, or outperform, a commonly used prognostic tool, the postoperative nomogram.
  • a prognostic signature combining the nomogram with expression levels for a subset of the identified prognostic marker genes gives additional predictive accuracy.
  • the subject invention provides a method of providing a prognosis for a patient comprising obtaining (e.g., generating or receiving) or supplying data for expression level(s) for one or more genes, selected from: PCOLN3, TGFB3, PAK3, RBM34, RPL23, EI24, FUT7, RICS Rho, MAP4K4, CUTLl , ZNF324B and various combinations thereof and assigning a prognosis to said patient on the basis of the expression levels of said genes.
  • the expression and level of expression of the genes is used to assign a prognosis (e.g., a "good prognosis" or a "'bad prognosis”) to the patient.
  • Another aspect of the invention provides for methods of assigning a prognosis class to a patient comprising: obtaining (e g., receiving or generating) data relating to said patient, wherein the data comprises expression lcvel(s) for one or more genes, selected from: PCOLN3, TGFB3, PAK3, RBM34. RPL23, EI24, FUT7, RlCS Rho.
  • MAP4K4, CUTLl , and/or ZNF324B or any combination thereof classifying the patient as belonging to a prognosis class (e.g., a ''good prognosis" or a "'bad prognosis") based upon expression levels for said genes and wherein a patient placed into a bad prognostic class is scheduled for increased surveillance for the recurrence of cancer.
  • a prognosis class e.g., a ''good prognosis" or a "'bad prognosis
  • the phrase "'increased surveillance” is used herein to indicate that the patient is subjected to an increased frequency of testing for various cancer markers (e.g., body scans, increased frequency for prostate specific antigen testing and/or increased frequency of digital rectal examination).
  • a further aspect of the invention provides for the use of a postoperative nomogram evaluation of the patient in combination with either of the methods discussed above.
  • a postoperative nomogram such as any one of those provided on the world wide web at mskcc/html/10088.cfm.is utilized in the practice of this invention, one set of genes that can be used for determining a patient's prognostic status is:
  • the following set of genes can be used for the development of a patient ' s prognosis.
  • the subject application also provides the following method of assigning treatment to a prostate cancer patient comprising: a) assigning a prognosis class to the patient in accordance with any of the preceding embodiments; and b) providing treatment to the prostate cancer patient based wholly or in part on the patient's prognosis class.
  • Nucleic acids including naturally occurring nucleic acids, oligonucleotides, antisense oligonucleotides, and synthetic oligonucleotides, that hybridize to the nucleic acid biomarkers of the invention (e.g.. that hybridize to polynucleotide gene products corresponding to any one of SEQ ID NOs: 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15 or 16) are useful as agents to detect the presence of biomarkers of the invention in the biological samples of cancer patients or those at risk of cancer. The detection of these biomarkers can be peformed according to methods well-known in the art, such as those described in the examples.
  • oligonucleotides comprising 8, 9, 10, 1 1, 12, 13, 14, 1 15, 16, 17. 18. 19, 20, 21. 22, 23, 24, 25, 26. 27, 28, 29, 30, 31, 32, 33, 34. 35, 36, 37, 38, 39, 40, 41 , 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60. 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73. 74, 75, 76, 77, 78.
  • oligonucleotides are those within the range of 9 to 50 nucleotides.
  • kits comprising from one to eleven containers, each container comprising a buffer and at least one polynucleotide selected from SEQ ID NO: 6, 7, 8. 9, 10, 11, 12, 13, 14, 15 or 16, fragments of said polynucleotide, or polynucleotides that hybridize with SEQ ID NO: 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 or 16.
  • the kit may contain fewer than eleven containers and each container may comprise a buffer and a combination of any two, three, four, five, six, seven, eight, nine, ten or 11 polynucleotides as set forth in the Nucleotide Combinations table provided below.
  • the values set forth in the Nucleotide Combinations correspond to the polynucleotides as identified by each respective SEQ ID NO: (see Table 9).
  • the number '"1" in the Nucleotide Combinations table refers to the lowest SEQ ID NO appearing in Table 9 (i.e., "1” refers to SEQ ID NO: 6, corresponding to PAK3).
  • “2” refers to the second-lowest SEQ ID NO appearing in Table 9 (i.e., "2” refers to SEQ ID NO: 7, corresponding to RPL23), and so on.
  • the first numeral is the numeral appearing in the Nucleotide Combinations table and the second numeral is the prostate cancer prognosis-related SEQ ID NO that is thereby referenced: (1,6); (2,7); (3,8); (4,9); (5,10); (6,11); (7,12); (8,13); (9,14); (10,15): and (11,16).
  • gene chips or arrays comprising or consisting of the various combinations of nucleic acids or oligonucleotides disclosed in the Nucleotides Combinations table or those nucleic acid or oligonucleotide sequences that hybridize to those nucleic acids.
  • Certain embodiments provide an array of 2000 or fewer nucleic acid sequences which contain the 5 gene combination or 1 1 gene combination (sets of genes) disclosed above.
  • the nucleic acids or oligonucleotides attached to the substrates of the gene chips can range from 8 consecutive nucleotides to the full length of any sequence disclosed herein.
  • a gene chip according to the invention comprises an array of 2000, 1500, 1000, 500 or 100 or fewer nucleic acid or oligonucleotide sequences which include those nucleic acids disclosed herein, are at least 8 consecutive nucleotides in length and arc no longer than the complete sequence for a given SEQ ID NO:.
  • a gene chip array of this invention contains at least 5, 6. 7, 8. 9, 10 or 1 1 of the disclosed nucleic acids sequences or oligonucleotides (or those that are fully complementary thereto) and no more than 2000, 1500, 1000, 500 or 100 discrete/different nucleic acid sequences.
  • an “array” represents an intentionally created collection of nucleic acid or oligonucleotide sequences that can be prepared either synthetically or biosynthetically.
  • the term “array” herein means an intentionally created collection of polynucleotides or oligonucleotides attached to at least a first surface of al least one solid support wherein the identity of each polynucleotide at a given predefined region is known.
  • the subject invention provides a gene chip comprising an array of 2000 or fewer nucleic acid sequences, provided that said array includes the five gene or eleven prognostic signature disclosed above and said nucleic acid sequences are at least 8 consecutive nucleotides in length.
  • the gene chips or arrays discussed herein can, in certain embodiments, contain anywhere from 8 to 50 consecutive nucleotides of the genes provided by the prognostic signatures disclosed herein.
  • the dataset was built from tissue samples obtained from 79 patients with clinically localized prostate cancer treated by radical prostatectomy at MSKCC between ] 993 and 1999. Thirty-nine cases had disease recurrence as classified by 3 consecutive increases in the serum level of PSA after radical prostatectomy, and forty samples were classified as nonrecurrent samples by virtue of maintaining an undetectable PSA ( ⁇ 0.05 ng/mL) for at least 5 years after radical prostatectomy. No patient received any neoadjuvant or adjuvant therapy before documented disease recurrence. Samples were snap frozen, examined histologically and enriched for neoplastic epithelium by macrodissection.
  • Gene expression analysis was carried out using the Affymetrix U133A human gene array which has 22,283 features for individual gene/EST clusters, as per manufacturer's instructions. Image processing was performed using Affymetrix Microarray Suite 5.0 to produce CEL files which were used directly in the present inventors' analyses.
  • genes to be incorporated into an MSKCC- based model were filtered using a variety of criteria that included a significant differential expression between the two classes (p-value ⁇ 0.001), a fold change >1.3, and a '"present" call in greater than 80% of the samples in either class.
  • Such filter methods with arbitrary cut-off thresholds may introduce bias. It is preferable to allow the computer to decide which genes are useful for prediction, without the use of any arbitrary pre-processing filters. Except for a simple re-scaling of the expression values of each gene to be between 0 and 1, no other preprocessing was performed.
  • the present inventors used a rigorous experimental protocol with the leave-one- out cross validation (LOOCV) method to estimate classifier parameters and prediction performance (Wessels et al. "A protocol for building and evaluating predictors of disease state based on micrarray data” Bioinformatics, Vol. 21, pp. 3755-62), as depicted in Figure 9.
  • the experimental protocol consists of inner and outer loops. In the inner loop, LOOCV is performed to estimate the optimal classifier parameters based on the training data provided by the outer loop, and in the outer loop, a held-out sample is classified using the best parameters from the inner loop. The experiment is repeated until each sample has been tested.
  • the classification parameters that need to be specified in the inner loop include the kernel width and sparsity parameter of the feature selection algorithm, as well as the structural parameters of a classifier, which leads to a multi-dimensional parameter search.
  • LDA Linear discriminant analysis
  • the present inventors predefined the kernel width as 5, and estimated the sparsity parameter through LOOCV in the inner loop. In simulations, the present inventors found that the choice of the kernel width is not critical, and the algorithm performs similarly for a large range of values for this parameter.
  • a receiver operating characteristic curve obtained by varying a decision threshold provides a direct view on how a predictive approach performs at the different sensitivity and specificity levels.
  • the specificity is defined as the probability that a patient who did not experience disease recurrence was assigned to the good-prognosis group, and the sensitivity is the probability that a patient who developed disease recurrence was assigned to the bad-prognosis group.
  • AUC area under a ROC curve
  • MedCalc version 8.0 MedCalc Software, Mariakerke. Belgium
  • the present inventors developed two computational models to predict the biochemical recurrence of prostate cancer.
  • the first model is based exclusively on gene expression data obtained from tissue samples, and the second combines the predictive information of both genetic and clinical variables.
  • the clinical variable used was the 7-year probability of disease recurrence estimated by the postoperative nomogram.
  • a hybrid signature derived by combining the gene expression data with clinical information outperformed both the nomogram and the genetic signature.
  • the hybrid signature improved the specificities of the genetic model and nomogram by about 10% and 20%, respectively. It correctly classified 74 out of 79 samples (94%), including 38 non-recurrent and 36 recurrent tumors ( Figure 14).
  • Statistical analysis of the ROC curves revealed the predictive accuracy of the hybrid signature to be significantly superior to that of the postoperative nomogram (p-value ⁇ 0.0001) and the gene-expression model (p-value ⁇ 0.05).
  • Figs. 15A-C survival data analyses were performed (see Figs. 15A-C).
  • the Mantel-Cox estimate of hazard ratio of biochemical recurrence of prostate cancer within five years for the hybrid model is 29.1 (95% CI: 8.3 - 102.1), which is much larger than those of either the nomogram (11.9, 95% CT: 3.8 -36.9) or the genetic model (18.0, 95% CI: 5.9 - 54.5).
  • each iteration in LOOCV may generate a different prognostic signature since the training data used is different (Simon, 2005, J Clin One , Vol. 23, pp. 7332-41).
  • a 5, 6, 7 and 8-gene model was developed in 7, 43. 24 and 5 iterations, respectively.
  • the 79 total iterations correspond to one iteration for each of the 79 patients in the dataset, as dictated by the LOOCV method.
  • a total of 1 1 unique genes were identified in the consensus genetic prognostic signature (Table 9). The mean expression of each gene in the 79 tumor samples obtained from patients with, and without, disease recurrence was visualized by creating individual scatter plots.
  • the present application provides a genetic signature that predicts disease recurrence after radical prostatectomy with 87% overall accuracy. Furthermore, a hybrid signature derived by combining the gene expression data with the 7-year PFP score outperformed both the nomogram and the genetic signature, correctly classifying 74 out of 79 samples. Statistical analyses also clearly demonstrated the superiority of the hybrid signature over a prognostic system that uses only genetic or clinical markers. Though the nomogram performs very well when the estimated 7-year disease prognosis-free probability is larger than 90%. it assigns a significant number of non-recurrence patients to the bad prognosis group. It is evident in Figures 14, 15, and 16 that microarray data provides additional information to stratify these patients.
  • RPL23 is a member of the ribosomal protein family that acts to stabilize rRNA structure, regulate catalytic function, and integrate translation with other cellular processes, but recent studies have shown that many ribosomal proteins have extra- ribosomal cellular functions independent of protein biosynthesis.
  • EI24/PIG8 is localized in the endoplasmic reticulum (ER), and by virtue of its binding Bcl-2, has been linked with the modulation of apoptosis.
  • PAK3 is a Group I member of the p21 -activated kinase (Pak) family serine/threonine protein kinases that bind to and modulate the activity of the small GTPases, Cdc42 and Rac.
  • Pak p21 -activated kinase
  • GTPase signaling controls many aspects of cellular response to the environment, and through these interactions, P ⁇ Ks have been shown to be involved in the regulation of cellular processes such as gene transcription, cell morphology, motility, and apoptosis.
  • Each method herein optionally includes taking a biological sample from the patient and optionally includes analyzing the sample using an article of manufacture or kit of the invention.
  • SEQ ID NO: 2 (CEGPl, presented here as sequence for SCUBE2)
  • SEQ ID NO: 4 (ATP5E, presented here as sequence for ATP5E)
  • SEQ ID NO: 5 (FRAME, presented here as sequence for FRAME)
  • ATCTCAGCAA ACAGGAGACT ACAGGGGACT GGGGATCAGG GTGTGGCCTG TGAGTGTCAG

Abstract

A method is provided that addresses the feature selection problem in the presence of copious irrelevant features. According to this method, feature selection can be accomplished by decomposing a given complex problem into a set of locally linear problems through local learning, and estimating the relevance of features globally within a large margin framework. Local learning allows one to capture local structure of the data, while the global parameter estimation within a large margin framework allows one to a\oid possible overfitting. This method addresses many major issues of the prior art, including their problems with computational complexity, solution accuracy, algorithm implementation, exportability of selected features, and extension to multiclass settings. Using the method, a small number of genes useful for predicting the occurrence of distal metastases in breast cancer patients were identified: LOC58509, CEGPl, AL080059, ATP5E. and FRAME. Also using the method, prostate cancer prognostic signatures based on gene expression alone and gene expression in combination with post-operative nomogram were derived. Genes determined to be particularly relevant to prostate cancer prognosis include PCOLN3, TGFB3, PAK3. RBM34. RPL23, EI24, FUT7, RlCS Rho. MAP4K4. CUTLl, and ZNF324B.

Description

DESCRIPTION
METHODS OF FEATURE SELECTION THROUGH LOCAL LEARNING; BREAST AND PROSTATE CANCER PROGNOSTIC MARKERS
GOVERNMENT SUPPORT
The subject matter of portions of this application has been supported by a research grant from the National Institutes of Health under grant number RO1CA108597-01. Accordingly, the government has certain rights in this invention.
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of U.S. 60/989,592 filed on November 21 , 2007; U.S. 61/040,232 filed on March 28, 2008; and U.S. 61/040,237 filed on March 28, 2008; each of these applications is incorporated by reference in its entirety (including all tables, sequence listings, and other associated data).
BACKGROUND OF INVENTION
Machine learning involves the design and development of algorithms and techniques that allow computers to "learn." The algorithms are designed to be able to improve automatically through experience. Machine learning research has been focused on the ability to extract information from data automatically by computational and statistical methods. Machine learning applications include, but are not limited to. natural language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics, cheminformatics. credit card fraud detection, stock market analysis, DNA sequence classifications, speech and handwriting recognition, object recognition in computer vision, game playing, and robot locomotion.
Feature selection is a fundamental problem in machine learning. With the advent of high throughput technologies, feature selection has become increasingly important in a wide range of scientific disciplines. The goal of feature selection is to extract the most relevant information about each observed datum from a potentially overwhelming quantity of its features. Here, relevant information means those features whose discriminative properties facilitate the underlying data analysis. A typical example where feature selection plays a critical role, is the use of oligonucleotide microarray for the identification of cancer- associated gene expression profiles of diagnostic or prognostic value such as discussed in "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring" by Golub et al., Science, 286, 531-537 (1999), "Gene expression profiling predicts clinical outcome of breast cancer" by van't Veer el al., Nature, 415, 530-536 (2002), and "Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer" by Wang el al., Lancet, 365, 671-679 (2005). Typically, the number of features (genes) associated with the raw data is in the order of thousands or even tens of thousands. Amongst this enormous number of genes, only a small fraction is likely to be relevant for cancerous tumor growth and/or spread. The abundance of irrelevant features poses serious problems for existing machine learning algorithms, and represents one of the most recalcitrant problems for their application in oncology and other scientific disciplines dealing with copious features.
The performance of most data-analysis algorithms suffers as the number of features becomes excessively large. This is typically due to the requirement that a training dataset used for estimating the algorithm parameters should increase in size linearly or even exponentially with the growing number of features. In cases where the training set cannot be extended by new experiments, the algorithm may confuse important properties of the data with those associated with a prominent presence of irrelevant features - a phenomenon called the curse of dimensionality in machine learning. It has been recently observed, for example, that even the support vector machine (SVM) - one of the most advanced classifiers, believed to scale well with the increasing number of features - experiences a notable drop in accuracy when this number becomes sufficiently large. In addition to defying the curse of dimensionality, eliminating irrelevant features can also be used to reduce system complexity, processing time of data analysis, and the cost of collecting irrelevant features. In many cases, feature selection can also provide significant insights into the nature of the problem under investigation. In oncology, for example, the identification of relevant genes can help advance the understanding of biological mechanisms underlying tumor behavior.
The problem of feature selection has plagued the scientific community for decades. Existing algorithms are traditionally categorized as wrapper ox filter methods, with respect to criteria used for searching relevant features. In wrapper methods, a classification algorithm is employed to evaluate the goodness of a selected feature subset. In filter methods, criterion functions evaluate feature subsets by their information content, typically intcrclass distance (e.g., Fisher score) or statistical measures (e.g., t-test), rather than optimizing the performance of any specified learning algorithm directly. Hence, filter methods are computationally much more efficient, but usually do not perform as well as wrapper methods.
However, there are a number of problems with wrapper methods. One major issue with wrapper methods is their high computational complexity. Many heuristic algorithms have been proposed to alleviate this computational issue. In the presence of tens of thousands features, a hybrid approach is usually adopted, wherein the number of features is first reduced by using a filter method, and then a wrapper method is used on the reduced feature set. Nevertheless, it still may take several hours to perform the search, depending on the classifier used in the wrapper method. To reduce complexity, in practice, a simple classifier (e.g., linear classifier) is often used to perform feature selection, and the selected features arc then fed into a more complicated classifier in the subsequent data analysis. This gives rise to the issue of feature exportability - in some cases, a feature subset that is optimal for one classifier may not work well for others.
Another issue associated with a wrapper method is the capability to perform feature selection for multiclass problems. To a large extent, this property depends on the capability of a classifier used in a wrapper method to handle multiclass problems. Some classifiers originally designed for binary problems can be naturally extended to multiclass settings, while for others the extension is not straightforward. In many cases, a multiclass problem is first decomposed into several binary ones by using an error-correct-code method, such as described in "Solving multiclass learning problems via error-correcting output codes," by Dietterich et al., J. Artif. Intel!, Res. , 2, 263-286, (1995), and "Unifying error-correcting and output-code AdaBoost through the margin concept," by Sun et al., Proc. 22nd Int. Conf. Mack Learn.., 872-879 (2005). Then, feature selection is performed for each binary problem. This strategy further increases the computational burden of a wrapper method.
One issue that is rarely addressed in the literature is algorithmic implementation. Many wrapper methods require to train a large number of classifiers and to manually specify many parameters, which makes their implementation and use rather complicated, demanding an expertise in machine learning.
To address the aforementioned issues, embedded methods, such as those described in "Embedded methods'" by LaI et al. from Feature Extraction, Foundations and Applications, I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, Eds. Springer-Verlag, 137-165 (2006), have recently received an increased interest. Embedded methods incorporate feature selection into the learning process of a classifier. A feature weighting strategy is usually adopted that uses real-valued numbers, instead of binary ones, to indicate the relevance of features in a learning process. These strategies do have some advantages. For example, there is no need to pre- specify the number of relevant features. Also, standard optimization techniques, such as gradient descent, can be used to avoid a combinatorial search. Hence, embedded methods are usually computationally more tractable than wrapper methods. Still, computational complexity is a major issue when the number of features becomes excessively large. For example, for the well-known recursive feature elimination (RFE) algorithm with the linear kernel, it takes about 3 hours to analyze 72 samples with 7129 features (see Guyon et al. "Gene selection for cancer classification using support vector machines," Mack Learn. , 46, 389-422 (2002)). Other issues, such as algorithm implementation and extension to multiclass problems also remain.
BRIEF SUMMARY
The present invention provides solutions for large-scale feature selection problems for scientific and industrial applications. The systems and methods of the subject invention address and/or substantially obviate one or more problems, limitations, and/or disadvantages of the prior art.
Advantageously, in one embodiment, the present invention provides a method for feature section incorporating decomposing a complex non-linear problem into a set of locally linear problems using local learning, and estimating feature relevance globally within a large margin framework. According to the present invention, the local learning allows one to capture local structure of the data, while the parameter estimation is performed globally to avoid possible overfϊtting.
In one embodiment of the subject invention, there is provided a method capable of handling extremely large number of features.
In one embodiment of the subject invention, there is provided a method capable of handling arbitrarily complex, nonlinear problems. In another embodiment of the subject invention, there is provided a method of feature selection that addresses problems of computational complexity, solution accuracy, and esoteric implementation.
In another aspect of the present invention, the methods for feature selection can be used to predict metastatic behavior in medical applications such as the behavior of tumors in oncology.
In yet another aspect of the present invention, the methods for feature selection can be used for pattern classification and computer vision in computers science, and target recognition and signal classification in electrical engineering.
One aspect of the invention concerns a method of providing a prognosis, comprising obtaining (e.g., generating or supplying) data wherein the data comprises gene expression levels for a plurality of genes comprising or consisting of LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME. In certain embodiments, the prognosis of the patient is obtained by obtaining or generating expression profiles for LOC58509, CEGPl, AL080059, ATP5E and PRAME.
In another aspect, the invention concerns a method of assigning a prognosis class to a patient, comprising: (a) receiving gene expression data or generating gene expression data, wherein the data comprises a gene expression profile for a plurality of genes selected from LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME; and (b) classifying the patient as belonging to a particular prognosis class based upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl. ΛL080059, ΛTP5E, and/or PRAME, and wherein the prognosis class is a categorizalion that is correlated with risk of cancer occurrence or recurrence.
Another aspect of the invention concerns a method of assigning treatment to a breast cancer patient, comprising: (a) obtaining a biological sample of the breast cancer patient; (b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes selected from LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME; (c) classifying the patient as belonging to a particular prognosis class based upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME; and (d) assigning, recommending or providing treatment to the breast cancer patient based wholly or in part on the patient's particular prognosis class. Another aspect of the invention is an article comprising a plurality of means for detecting the expression of a gene, wherein the individual means for detecting are each directed to detection of the product(s) of a particular gene, and the plurality of means for detecting detects expression of a plurality of genes comprising or consisting of LOC58509, CEGPl, AL080059, ATP5E. and PRAME, or subcombinations thereof. Optionally excluded from such articles of manufacture are commercially available genechips or other solid supports that contain (comprise) LOC58509, CEGPl, AL080059, ATP5E. and PRAME, or subcombinations thereof. Preferably, each said individual means for detecting is a polynucleotide probe, an antibody, a set of one or more components capable of adapting the article to implementation of a demonstrated specific method of analysis, a set of one or more components capable of adapting the article to implementation of a validated specific method of analysis, or a set of one or more components capable of adapting the article to implementation of an approved specific method of analysis.
The invention further relates in part to prognostic markers for prostate cancer. One aspect of the invention includes prognostic signatures shown to perform comparably to and/or to outperform a common clinically-used postoperative nomogram (such as those provided on the world wide web at www.mskcc.org/mskcc/html/10088.cfm). A prognostic signature based purely on gene expression information and a hybrid prognostic signature based on both clinical data and gene expression information are shown to be useful for determining accurate prostate cancer prognoses. Genes and/or sequences shown to be useful prognostic markers (prognostic signatures) include without limitation PCOLN3, TGFB3, PAK3, RBM34. RPL23, E124, FUT7, RICS Rho, MAP4K4, CUTLl, and ZNF324B or various combination thereof.
Thus, one aspect of the invention concerns a method of assisting in developing a prognosis for a patient comprising obtaining (e.g. , generating or receiving) or supplying data for a plurality of genes selected from one or more of the following genes: PCOLN3, TGFB3, PAK3. RBM34, RPL23, E124, FUT7. RICS Rho, MAP4K4, CUTLl, and/or ZNF324B or various combinations thereof and assigning a prognosis class (e.g., a "'good prognosis" class or a "bad prognosis" class) to the patient on the basis of the gene expression. The prognosis class is a categorization that is correlated with risk of cancer occurrence or recurrence. For patients predicted to have a high risk of cancer recurrence (e.g., a "bad prognosis"), increased frequency of post operative surveillance for the recurrence of cancer is performed. For patients with a low risk of a cancer recurrence standard surveillance for the recurrence of cancer can be performed.
Another aspect of the invention concerns a method of assigning a prognosis class to a patient comprising: (a) obtaining (e g . receiving or generating) data relating to said patient, wherein the data comprises both a gene expression profile for a plurality of genes comprising or consisting of PCOLN3, TGFB3, PAK3, RBM34, RPL23. EI24, FUT7, RICS Rho, MAP4K4, CUTLl, and ZNF324B; and a postoperative nomogram evaluation; and (b) classifying the patient as belonging to a prognosis class (e.g., a "'good prognosis" class or a "bad prognosis'" class) based upon both the nomogram evaluation and the gene expression profile, wherein the prognosis class is a categorization that is correlated with risk of cancer occurrence or recurrence.
Another aspect of the invention concerns a method of assigning treatment to a prostate cancer patient comprising: (a) obtaining a biological sample from the prostate cancer patient; (b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes comprising or consisting of PCOLN3, TGFB3, PAK3, RBM34, RPL23, EI24. FUT7, RICS Rho, MAP4K4. CUTLl, and ZNF324B; (c) classifying the patient as belonging to a prognosis class based upon a subset of the gene expression profile, wherein the subset comprises gene expression levels for said plurality of genes; and (d) assigning treatment to the prostate cancer patient based wholly or in part on the patient's prognosis class.
Yet another aspect of the invention concerns a method of assigning treatment to a prostate cancer patient comprising: (a) obtaining a biological sample from the prostate cancer patient; (b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes comprising or consisting of PCOLN3, TGFB3, PAK3, RBM34, RPL23, EI24, FUT7, RICS Rho, MAP4K4, CUTLL and ZNF324B; (c) performing upon the patient a postoperative nomogram evaluation: (d) classifying the patient as belonging to a prognosis class based upon: (1) a subset of the gene expression profile, wherein the subset comprises gene expression levels for the plurality of genes, and (2) the result of the postoperative nomogram evaluation; and (e) assigning treatment to the prostate cancer patient based wholly or in part on the patient's said particular prognosis class. BRIEF DESCRIPTION OF DRAWINGS
Figures IA and IB show Fermat's spiral problem. Figure IA shows a two- dimensional spiral shape distribution of samples belonging to two classes; Figure IB shows a local learning transformation of the spiral problem into a set of locally linear problems according to an embodiment of the present invention.
Figure 2 shows iteratively refined estimates of weight vector and probabilities until convergence according to an embodiment of the present invention.
Figures 3A-3H show feature weights learned on the spiral dataset with different numbers of irrelevant features, ranging from 50 to 30000. according to an embodiment of the present invention.
Figure 4 shows run times of a feature selection process used on the spiral dataset according to embodiments of the present invention.
Figures 5A-5E show feature weights learned on the spiral dataset with different regularization parameters according to embodiments of the present invention.
Figures 6A-6F show feature weights learned on the spiral dataset with different kernel widths according to embodiments of the present invention.
Figure 7 shows a plot for convergence analysis of embodiments of the present invention.
Figures 8A-8G show feature weights learned using an embodiment of the present invention on seven UCI datasets.
Figure 9 illustrates the experimental protocol. The experimental protocol consists of inner and outer loops. In the inner loop, LOOCV is performed to estimate the optimal classifier parameters based on the training data provided by the outer loop, and in the outer loop, a held-out sample is classified using the best parameters from the inner loop. The experiment is repeated until each sample has been tested.
Figure 10 provides receiver operating characteristic (ROC) curves of four breast cancer prognostic signatures. The legends New-LDA and New-Corr refer to the prediction results of the gene signature identified according to an embodiment of the present invention, as obtained by using LDA and a correlation-based classifier, respectively.
Figures 11A-11D show Kaplan-Meier estimations of the probabilities of remaining distant metastases free in patients with a good or bad breast cancer prognosis, determined by the new signature according to an embodiment of the invention (HA), St. Gallcn criterion (1 IB), 70-gcne signature (1 1C), and hybrid signature (1 ID). The p-value is computed by the use of log-rank test.
Figure 12 shows feature weights of genes identified according to embodiments of the present invention.
Figure 13 shows run time of a feature selection process used to identify a gene signature for a breast cancer dataset according to embodiments of the present invention.
Figure 14 presents receiver operating characteristic (ROC) curves comparing the prediction performance of the nomogram, genetic, and hybrid models for prostate cancer prognosis.
Figures 15A-C shows Kaplan-Meier estimation of the probabilities of remaining biochemical recurrence free for patients with a good or bad prostate cancer prognosis, determined by using the hybrid (3A) nomogram (3B), and microarray (3C) models. The p- value is computed in each case by use of the log-rank test.
Figures 16A-K show scatter plots of prostate cancer gene markers demonstrating clear up- or down-regulation between patients with and without biochemical recurrence.
DETAILED DISCLOSURE
The terms "comprising", "consisting of and "consisting essentially of" are defined herein according to their standard meaning. The terms may be substituted for one another throughout the instant application in order to attach the specific meaning associated with each term.
The present invention provides methods for feature selection in problems across many disciplines dealing with copious features, including but not limited to bioinformatics, economics, and computer vision. A feature selection algorithm is provided that can be used in embodiments of the present invention. The feature selection algorithm can be used to address issues with prior work, including the problems with computational complexity, solution accuracy, esoteric implementation, capability of handling an extremely large number of features, exportability of selected features, and extension to multiclass problems.
According to the present invention, an arbitrarily complex, nonlinear problem can be decomposed into a set of locally linear problems through local learning, and the relevance of features can be estimated globally within a large margin framework. Local learning allows one to capture local structure of the data, while the global parameter estimation allows one to avoid, or reduce, possible overfitting.
The local learning relates to computational systems. In the computational systems, a model learns by slowly changing a set of parameters, called weights, during presentations of input data. The success of learning is measured by an error function. Learning is completed when the weights are sufficiently close to the values that minimize the error function. Under the best circumstances, the weights become successively closer to the optimal weights with additional training. This behavior can also be considered as convergence.
The local learning equations slowly change the weights until the error function is minimized. The ability to perform local learning allows for use of local knowledge that can then be affected by global information.
Through linearization, the long-standing feature selection problem can be easily solved by using machine learning and numerical analysis techniques. A detailed formulation of the feature selection algorithm is also provided.
In comparison with other feature selection algorithms, embodiments of the present invention have many advantages. The algorithm of the present invention is capable of processing many thousands of features within a few minutes on a personal computer, while maintaining a close-to-optimum accuracy that is nearly insensitive to a growing number of irrelevant features. Due to local learning, wherein no assumption is made about data distributions, selected feature subsets exhibit excellent exportability to different classifiers. Unlike most prior work whose implementation demands an expertise in machine learning, embodiments of the present invention are very easy to implement. The subject algorithm can be coded in Matlab™ with less than one hundred lines. The subject algorithm can have two user defined parameters that can alternatively be estimated through cross validation. According to an exemplary embodiment, the twro user defined parameters are kernel width and regularization parameters. The performance of embodiments of the subject algorithm is largely insensitive to a wide range of values for these two parameters, which makes parameter tuning and hence the implementation of the algorithm easy in real applications.
For clarity, binary problems are being addressed in this embodiment, but embodiments are not limited thereto. In particular, multiclass problems can be addressed through a generalization of the algorithm described in this embodiment to be described later. To begin, a training dataset D = {(xn,yn)}^=l c ϋ' x {±1} is supposed consisting of N samples each represented by J features, where xn is the n-ih data sample and yn is its corresponding class label.
Then, the margin is defined for the training dataset. Given a particular distance function, the two nearest neighbors of each sample Xn. one from the same class (called nearest hit or NH), and the other from the different class (called nearest miss or NM) can be found {see "A practical approach to feature selection"' by Kira et al., Proc. 9th Int. Conf. Mack Learn., 249-256 (1992)). The margin of xn is then defined as pn = d (xn, NM(X11 )) - d(xn , NH(Xn)) , where d(-) is the distance function. Although block distance is being used here to define a sample's margin and nearest neighbors, other standard definitions may also be used. An intuitive interpretation of this margin is a measure as to how much the features of xn can be corrupted by noise (or how much xn can "move1' in the feature space) before being misclassified. By the large margin theorem described in "Statistical Learning Theory" by Vapnik, Statistical Learning Theory, New York: Wiley (1998), a classifier that maximizes a margin-based criterion function usually generalizes well on unseen test data, or is robust against noise. According to the present invention, each feature is scaled, and thus a weighted feature space can be obtained, parameterized by a nonnegative vector w, so that a margin-based criterion function in the induced feature space is maximized. The parameterization can be a linking of the scaled features to a feature weight vector. The margin of Xn, computed with respect to w, is given by: pn(yv) = d(x,,
Figure imgf000012_0001
-~ d(xιl ,NH(x
Figure imgf000012_0002
. (1)
By defining Zn
Figure imgf000012_0003
, where is an element-wise absolute operator, pn (w) can be simplified as:
A (W) = W7 Zn . (2)
By construction, the magnitude of each element of w in the above margin definition reflects the relevance of the corresponding feature in a learning process. Note that the margin thus defined requires only information about the neighborhood of Xn, while no assumption is made about the underlying data distribution. This means that by local learning, an arbitrary nonlinear problem can be transformed into a set of locally linear problems. To demonstrate this property, the well-known Fermat's spiral problem is considered as illustrated in Fig. I A. Referring to Fig. IA, samples belonging to two classes are distributed in a two-dimensional space, forming a spiral shape. Local learning transforms the nonlinear spiral problem into a set of locally linear problems. By projecting the transformed data zn onto the feature weight vector w. depicted in Fig. IB. it can be seen that most samples have positive margins.
The local linearization of a nonlinear problem allows avoidance of the computational difficulties of prior work. It also facilitates the mathematical analysis of the algorithm. However, a problem with the above margin definition is that the nearest neighbors of a given sample are unknown before learning. In the presence of thousands of irrelevant features, the nearest neighbors defined in the original space can be completely different from those in the induced space, as demonstrated in Fig. 2. To account for the uncertainty in defining local information, a probabilistic model is provided where the nearest neighbors of a given sample are treated as latent variables. Following the principles of the expectation-maximization algorithm described in "Maximum likelihood from incomplete data via the EM algorithm (with discussion)" by Dempster et al., J. R. Slat. Sυc. Ser. B, 39, 1-38, (1977), the margin can be estimated through taking the expectation of /?n(w) by averaging out the latent variables: pn
Figure imgf000013_0001
- x J]- E,_Hm [xn - xj]
= w J]P(x, = NM(xn)w)|xn - x; | - - x (3)
Figure imgf000013_0002
= wrz, where Mn = {i : l ≤ i ≤ N, y, ≠ yn } , H11 = {/ : 1 < / < N, y, = }„ ,i ≠ n} , P(x, = JVM(Xn )|w) and P(X1 = NH(xn ) w) arc the probabilities that sample x, is the nearest miss or hit of Xn. respectively. These probabilities are estimated through the standard kernel density estimation method:
Figure imgf000013_0003
and
P(X1 = NH(Xn ) W) ,VZ e Hn (5) χn - χ, ) where k(-) is a kernel function. Specifically, the exponential kernel k(d) = cxp(-d / σ) is used, where the kernel width σ is an input parameter. Other kernel functions can also be used, and the descriptions of their properties can be found in "Locally weighted learning" by Atkeson el ah, Artif. Intell. Rev., 11, 11-73. (1997), which is hereby incorporated by reference in its entirety.
After the margins are defined, the problem of learning feature weights can be directly solved within a large margin framework. Two most popular margin formulations are SVM (see "Statistical Learning Theory" by Vapnik (Wiley, New York, 1998)) and logistic regression (see "Pattern Recognition and Machine Learning" by Bishop (Springer. 2006)). Due to the nonncgative constraint on w, the SVM formulation represents a large-scale optimization problem. For computational convenience, the estimation is performed in the logistic regression formulation, which leads to the following optimization problem:
min∑log(l + exp(-w; Zn))
(6) s.t. w > 0, where w > 0 means that each element of w is nonnegative.
In applications with a copious amount of features, it may be expected that most features are irrelevant. For example, in cancer prognosis, most genes are not involved in tumor growth and/or spread. To encourage the sparseness, one commonly used strategy is to add the l\ penalty of w to an objective function, which leads to the following optimization problem:
min^log(l + cxp(-wy zj) + λ\ w
«=i (7) s.l. w > 0 where λ is a user defined parameter that controls the penalty strength, and consequently, the sparseness of the solution.
Since Xn implicitly depends on w through the probabilities P(X1
Figure imgf000014_0001
and
P(x, = NM(xn) w) , a recursion method can be used to solve for w. In each iteration, zn is first computed by using the previous estimate of w, which is then updated by solving the optimization problem of equation (7). The iterations are carried out until convergence.
For fixed Xn , equation (7) is a constrained convex optimization problem. However, due to the nonnegative constraint on w. it cannot be solved directly by using a gradient descent method. To overcome this difficulty, the problem can be slightly reformulated as: mm in ∑ log 1 + exp - ∑i Xu) + λ (8) n-1 thus obtaining an unconstrained optimization problem. It can be shown that at the optimum solution, w = v 2 , 1 <j < J. The solution of v can be readily found through gradient descent with a simple update rule:
Figure imgf000015_0001
where <S> is the Hadamard operator, and η is the learning rate determined by the standard line search. Note that the objective function of (8) is no longer a convex function, and thus a gradient descent method may find a local minimizer or a saddle point. The following theorem shows that if the initial point is properly selected, the solution obtained when the gradient vanishes is the global minimizer.
Theorem 1. Let f(x) be a strictly convex function of x e RJ and g(x) = f(y), where y - [^1 " "'J7 /] = [X|2 '' "'x /T • If If x=x+ = 0> then X+ is not a local minimizer, but a saddle point or a global minimizer of g(x) moreover, if x is found through gradient descent with an initial point x( ; 0) ≠ 0 , 1 <j < J then X+ is the global minimizer ofg(x).
For fixed τn , the objective function of equation (7) is a strictly convex function of w. After the feature weighting vector is found, the pairwise distances among data samples are reevaluated using the updated feature weights, and the probabilities P(X1 = NH (xn )|w) and
P(xy = A7M(Xn ) W) are re-computed using the newly obtained pairwise distances The two steps are iterated until convergence.
A pseudo-code of the algorithm is presented as follows:
Algol it kin 1 runti me St1ItH Lion ΛUymtliiu
Input Out " > \ ±\ }. kernel widl I) σ, .spars-.il y fiϊi/atioϊi pal aiiit'U r A, d uppπiβ, cπtcπoϊj 0
Output π at Uf Vt <ιs«bl> w liufiujkatioii Sr t w t!)i = 1. < 1 j I — 1
f 1Oi)IpIiIf* '/lx,,, x, TV * 1 P;
£ ( Vjinpiitf / '{x,=\M(x.( ) |w" - f l j dm TiXj = N H t X,. ) yv ' ' " ^ ) »h in 1 1 ) iiυd ι "5 L
Hulvt' lor v ■Λ in ι91:
8 K l ≤ j C J
7 £ = £ -*- J . until ||wp" - wf'-' ' || <. 0
0
The main complexity of the algorithm comes from computing pairwise distances between data samples. Thus, the computational complexity of the algorithm is 0(TN2J), where T is the number of iterations, J is the feature dimensionalit)', and N is the number of data samples.
A close look at the update equation of v, given by equation (9), allows further reductions in complexity. If some elements of v are very close to zero, for example, less than 10 5. the corresponding features can be eliminated from further consideration with a negligible impact on the subsequent iterations. This elimination mechanism is similar to that used in RFE (see Guyon et al., Gene selection for cancer classification using support vector machines, Mach Learn , 46, 389-422, 2002). However, unlike in RFE, making a decision as to which feature and when it should be eliminated is completely automatic in the algorithm according to the present invention.
The effectiveness of the proposed algorithm is empirically validated on a toy example and se\cral real-world datascts. Using these examples, it is demonstrated that the algorithm is capable of handling problems with an extreme!) large feature dimensionalit) Accordingly, embodiments of the subject algorithm can be utilized in methods, including, but not limited to detecting or prognosticating a disease or condition, methods for pattern recognition and identification, natural language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics. chcminformatics, credit card fraud detection, stock market analysis, DNA sequence classifications, speech and handwriting recognition, object recognition in computer vision, game playing, and robot locomotion. For example, for sound or image pattern recognition, the dataset can include data obtained from an image or sound recording of a physical object.
A simulation study is performed on an artificially generated dataset, carefully designed to verify various important properties of the algorithm. This example, also called Fermat's spiral problem, is a binary classification problem. Each class has 230 samples distributed in a two-dimensional space, forming a spiral shape, as illustrated in Fig. IA. In addition to the first two relevant features, each sample can be represented by a varying number of irrelevant features, where this number is set to {50, 100, 500, 1000, 5000, 10000, 20000, 30000}. The number 30.000 exceeds the amount of features experienced in most scientific fields. For example, human beings have about 25,000 genes, and hence nearly all gene expression microarray platforms have less than 25,000 probes. The added irrelevant features are independently sampled from the zero-mean and unit-variance Gaussian distribution. The task for this simulation is to identify the first two relevant features. Note that only if these two features are used simultaneously can the two classes of samples be well separated. Most filter and wrapper approaches perform poorly on this example, since in the former the goodness of each feature is evaluated individually, while in the latter the search for relevant features is performed heuristically.
Fig. 2 illustrates the dynamics of the algorithm performed on the spiral data with 10.000 irrelevant features. The algorithm iteratively refines the estimates of weight vector w and probabilities P(X1 - A1H(Xn ) w) and P(X1 = NM(xn) w) until convergence. Each sample is colored according to its probability of being the nearest miss or hit of a given sample indicated by the black cross in Fig. 2. With a uniform initial point, the nearest neighbors defined in the original feature space can be completely different from the true ones. The plot also shows that the algorithm converges to a perfect solution in just three iterations.
Figs. 3A-3H present the feature weights learned by the algorithm on the spiral data for a varying number of irrelevant features. The results arc obtained for parameters σ and λ set to 2 and 1 , respectively, while the same solution holds for a wide range of other values of kernel widths and regularization parameters (insensitivity to a specific choice of these parameters will be discussed shortly). Figs. 3 A-3H show that the subject algorithm performs remarkably well over a wide range of feature-dimensionality values that are of practical interest - while using the same parameters. Also note that the feature weights learned across all feature- dimensionality values are almost identical lhe algorithm is computationally very efficient. Fig. 4 shows the CPU time it takes the algorithm to perform feature selection on the spiral dataset with different numbers of irrelevant features. The computer setting is Pentium4 2.80 GHz with 2.00 GB RAM. As can be seen, the algorithm runs for only 3.5 seconds for the problem with 100 features, 37 seconds for 1000 features, and 372 seconds for 20,000 features. The computational complexity is linear with respect to the feature dimensionality. The existing wrapper methods do not have computational complexity even close to the subject algorithm. Depending on the classifier used in search for relevant features, it may take several hours for a wrapper method to analyze the same dataset with only 1000 features, and yet there is no guarantee that the optimal solution will be reached, due to heuristic search.
The kernel width σ and the regularization parameter λ are two input parameters of the algorithm. Alternatively, they can be estimated through cross validation on training data. It is well-known that cross validation may produce an estimate with a large variance. Fortunately, this does not pose a serious concern for the subject algorithm. In Figs. 5A-5E and 6A-6F, the feature weights learned with different kernel widths and regularization parameters are plotted. As shown by these plots, the algorithm performs well over a wide range of parameter values, yielding the largest weights for the first two relevant features, while the other weights are significantly smaller. This suggests that the algorithm's performance is largely insensitive to a specific choice of parameters σ and λ, which makes parameter tuning and hence the implementation of the algorithm easy, even for researchers outside of the machine learning community.
Fig. 7 presents the convergence analysis of the subject algorithm that runs on the spiral dataset with 5000 irrelevant features, for λ = 1 and different kernel widths σ e {0.01, 0.05, 0.5. 1. 10, 50}. It can be observed that the algorithm converges for a wide range of σ values, and that in general a larger kernel width yields a faster convergence. These results validate the theoretical convergence analysis, presented below in the section entitled CONVERGENCE ANALYSIS. CONVERGENCE ANALYSIS
This section presents the convergence analysis of the algorithm according to the present invention. First, its asymptotic behavior is studied. If σ — > +∞ , for every w.
1
Hm P(x, = A-Af(Xn ) w) , VZ e M1, (10)
M. since Hm k(d) = 1 . On the other hand, if σ -> 0 , by assuming that for every Xn,
t/(xn,x,|w) ≠ cf(xn .x7|w) if i ≠j,
1, il'd(xn,x, I w) = mm d(xn,x; | w) ≡Mn σ H→mϋP(X1 = NM(j,,pf) -- J
0, if<i(xn,x, | w) > m / <≡mw. d(xn,x | w) (H)
Similar asymptotic behavior holds for P(X1 = NH (xn )|w) . From the above analysis, it follows that for σ -> +∞ the algorithm converges to a unique solution in one iteration, since P(X1
Figure imgf000019_0001
are constants for any initial feature weights. On the other hand, for σ -> 0 , rarely is there an empirical observation that the algorithm converges. This suggests that the convergence behavior and convergence rate of the algorithm are fully controlled by the kernel width. To prove this, the well-known Banacli fixed point theorem is used. The fixed point theorem is first stated without proof, which can be found for example in "Numerical Analysis"' by Kress (Springer- Verlag, New York, 1998).
Definition 1. Let U be a subset of a normed space Z, and ■ is a norm defined in Z.
An operator T : U —> Z is called a contraction operator if there exists a constant q <≡ [0,1) such that
\T(x) - T(y)\\ < q\x - y\ (12) or every x; y e ll . Here, q is called the contraction number of T.
Definition 2. An element of normed space Z is called a fixed point of T : U —> Z if T(x) = x.
Theorem 2 (Fixed Point Theorem). Let T be a contraction operator mapping a complete subset U of a normed space Z into itself. Then the sequence generated as
,('-I) _ = T(xi l>) , t = 0, l, 2, ... (13) with arbitrary x(0) e U , converges to the unique fixed point x of T. Moreover, the following estimation error bounds hold: xU) - - x < x(1) - x(0)
\ - q
* and x0) - - X < q' x{l) - x(M) (14)
\ - q
Then, the fixed point theorem is used to prove that the subject algorithm converges to a unique fixed point. The gist of the proof is to identify a contraction operator for the algorithm, and make sure that the conditions of Theorem 2 are met. To this end, P = {p : p = [P(x, = NM(x,,)|w),i>(x7 = NH(xn)|w)]} is defined, and the first step of the algorithm is specified in a functional form as 71 : R^0 -» P , where 71(w) = p, and the second step is specified as Tl : P -» i?'o, where 72(p) = w . Here Rζϋ = {w : w e i?'7,w > 0} . Then, the algorithm can be written as w('} = (T 2 ° Tl)(W1'"1') = r(w(M)) , where (o ) denotes functional composition and T : R>0 -> i?^0 . Since i? 7 0 is a closed subset of normed space RJ and complete, T is an operator mapping complete subset R J (j into itself. Next, note that for σ -> +00 , the algorithm converges with one step. Here, lim I T(w, , σ) - 7Yw, , σ) = 0 , for σ→+∞' every W1 3W2 e i?j0 . Therefore, in the limit, T is a contraction operator with contraction constant q = 0, that is, lim q{σ) = 0. Therefore, for every ε > 0, there exists σ* such that
q{σ) ≤ ε whenever σ > σ\ By setting ε < 1, the resulting operator T is a contraction operator. By the Banach fixed point theorem, the subject algorithm converges to a unique fixed point provided the kernel width is properly selected. The above arguments establish the convergence theorem of the algorithm, as stated below.
Theorem 3. For the feature selection algorithm, defined in Algorithm 1, there exists σ such that Hm w(" - w('-n = 0 whenever σ >σ . Moreover, for a fixed σ >σ , the algorithm t→-∞ converges to a unique solution for any nonnegative initial feature weights W-°\
The theorem ensures the convergence of the algorithm if the kernel width is properly selected. This is a very loose condition, as the empirical results show that the algorithm always converges for a sufficiently large kernel width (see Fig. 7). Also, the error bound in equation (14) indicates that the smaller the contraction number, the tighter the error bound and hence the faster the convergence rate. The experiments suggest that a larger kernel width yields a faster convergence. Unlike many other machine learning algorithms (e.g., neural networks), the convergence and the solution of the subject algorithm are not affected by the initial value if the kernel width is fixed. Even if the initial feature weights were wrongly selected, and the algorithm started computing erroneous nearest misses and hits for each sample, the theorem assures that the algorithm will eventually converge to a unique solution. This property has a very important consequence that the algorithm is capable of handling problems with an extremely large feature dimensionality, which is experimentally demonstrated with respect to the spiral problem (see Fig. 3).
MULTI-CLASS PROBLEMS
Embodiments of the present invention can be used to perform feature selection for multiclass problems. Some existing feature selection algorithms, originally designed for binary problems, can be naturally extended to multiclass settings, while for others the extension is not straightforward. In particular, for both embedded and wrapper methods, the extension largely depends on the capability of a classifier to handle multiclass problems. In many cases, a multiclass problem is first decomposed into several binary ones by using an error-correct-code method, and then feature selection is performed for each binary problem. This strategy further increases the computational burden of embedded and wrapper methods. In contrast, the algorithm according to the present invention is based on local learning, and hence does not suffer from this problem.
A natural extension of the margin defined in equation (1) to multiclass problems is:
Figure imgf000021_0001
where Y is the set of class labels.
Figure imgf000021_0002
is the nearest neighbor of Xn from class c, and Dc is a subset of D containing only samples from class c. A further explanation of this margin equation can be found in "Unifying error-correcting and output-code AdaBoost through the margin concept by Sun el cil., Proc 22nd Int. Conf. Mach Learn , 872-879 (2002), which is hereby incorporated by reference in its entirety. Accordingly, the derivation of a feature selection algorithm for multiclass problems according to embodiments of the present invention can be provided by using the margin defined in equation (15). The mathematical derivation is not included here, but can be accomplished through a straightforward approach. UCI DATASET EXPERIMENTS
IMs section presents feature selection results using the subject algorithm on seven benchmark UCI datasets obtained from the University of California at Irvine Repository of Machine Learning Databases, including banana, waveform, twonorrn, thyroid, heart, diabetics, and splice. The information about each dataset is summarized in Table 1.
Table 1 : Summary of UCI datasets
Dataset Train Test Feature twonorm 400 7000 20(5000) waveform 400 4600 21(5000) banana 468 300 2(5000) thyroid 140 75 5(5000) diabetics 468 300 8(5000) heart 170 100 13(5000) splice 400 2175 60(5000)
For each dataset, the set of original features is augmented by a total of 5000 artificially generated irrelevant features, which is indicated in the parentheses in Table 1. The irrelevant features are independently sampled from a Gaussian distribution with zero-mean and unit-variance. It should be noted that some features in the original feature sets may be irrelevant or weakly relevant, and hence may receive zero weights in an embodiment of the subject algorithm. Since the true relevance of the original features is unknown, to verify that an embodiment of the subject algorithm does indeed discover all relevant features, the classification performance of SVM (with the Radial Basis Function (RBF) kernel) is compared in two cases: (1) when only the original features are used, and (2) when the features selected by the subject algorithm are used. It is well known that SVM is very robust against noise, and that the presence of a few irrelevant features in the original feature sets should not significantly affect its performance. Hence, the classification performance of SVM obtained in the first case should be very close to that of SVM performed on the optimum feature subsets that are a priori unknown. If SVM performs similarly in both cases, it will be possible to conclude that the subject algorithm achieves close-to-optimum solutions.
The structural parameters of SVM are estimated through ten-fold cross validation on training data. To further demonstrate that the performance of embodiments of the subject algorithm are largely insensitive to a specific choice of the input parameters, kernel width σ — 2 and regularization parameter λ = 1 for all seven UCI datasets are used. The stopping criterion is θ - 0.01. The algorithm is run 10 times for each dataset. In each run, a dataset is randomly partitioned into training and test sets. The averaged classification errors and standard deviations (%) of SVM are reported in Table 2.
Table 2
Dataset SVM SVM FDR CPU time
(selected features) (original features) (seconds per run)
Iwonorm 2.6(0.2) 2.6(0.2) 0/1000 Ϊ62 waveform 11.7(0.9) 10.1(0.6) 1.6/1000 157 banana 10.9(0.5) 10.9(0.5) 0.1/1000 97 thyroid 23.7(1.1) 24.9(1.3) 1.8/1000 267 diabetics 5.3(2.4) 4.7(2.1) 0.1/1000 13 heart 17.2(4.2) 18.4(4.0) 0.4/1000 73 splice 12.9(2.1) 14.4(0.9) 1.0/1000 330
Average / / 0.7/1000 157
Referring to Table 2, the false discovery rate (FDR), represents the ratio between the number of artificially added, irrelevant features with non-zero weights and the total number of irrelevant features (i.e., 5000). Feature weights, learned in one sample trial, for each of seven datasets, are shown in Fig. 8. Referring to Fig. 8, the dashed lines indicate the number of original features. The weights plotted on the left side of the dashed line are associated with the original features, while those on the right are associated with the additional 5000 irrelevant features. From these experimental results, the following can be observed:
(1) From Table 2. it is shown that SVM using the features identified by the subject algorithm performs similarly or even slightly better than SVM using the original features. This suggests that the subject algorithm can achieve a close-to-optimum solution in the presence of an extremely large number of irrelevant features. It should be emphasized that the results are obtained without estimating the optimal input parameters.
(2) In addition to successfully identifying relevant features, the subject algorithm performs remarkably well in removing irrelevant features. The false discovery rate, averaged over seven datasets, is only 0.7/1000. The feature weights learned on one realization of each dataset are plotted in Fig. 8. For ease of presentation, the maximum value of each feature weight vector is normalized to 1. For some datasets (e.g., banana and thyroid), the weights of the false positives are very small, and it is possible to achieve a perfect solution by slightly increasing the regularization parameter λ.
(3) According to the averaged CPU times of the seven datasets presented in Table 2, it is shown that the subject algorithm is capable of processing thousands of features within a few minutes on personal computers.
The subject algorithm is based on local learning, wherein no assumption is made about the underlying data distribution. Consequently, the features selected by the subject algorithm exhibit very good exportability. To demonstrate this, two other classifiers: KNN (K-Nearest Neighbor) and C4.5, are applied to the seven UCl datasets. KNN and C4.5 are two of the most popular classifiers used in real applications. The experimental protocol is exactly the same as before. The averaged classification errors and standard deviations (%) of the two classifiers are reported in Tables 3 and 4, respectively. For comparison, the classification results of KNN and C4.5 using all features (i.e., original and 5000 irrelevant features) are also reported. Due to computational reasons, a similar experiment was not performed for SVM. Note that the performance of the three classifiers is not important here, but rather how they perform on the original features and the features selected in the presence of 5000 irrelevant ones.
Table 3
Dataset KNN KNN KNN (selected features) (original features) (all features)
Iwonorm 3.1(0.2) 3.3(0.3) 36.4(1.7) waveform 12.2(0.9) 10.7(0.8) 31.0(0.9) banana 1 1.5(0.6) 1 1.5(0.6) 49.2(1.7) thyroid 24.8(1.4) 26.0(1.7) 24.7(1.3) diabetics 6.0(2.6) 4.7(2.2) 32.3(3.7) heart 19.6(4.3) 21.2(2.6) 34.8(4.5) splice 17.5(1.4) 30.6(4.4) 40.3(1.9) Table 4
Dataset OL5 CU5 C45
(selected features) (original features) (all features) twonorm 20.5(0.6) 20.5(06) 28.4(2.4) waveform 17.2(1.0) 17.9(1.0) 23.9(1.5) ήαwαwa 15.6(2.4) 15.6(2.4) 40.6(11.5) thyroid 26.7(1.4) 27.2(2.6) 36.2(3.7) diabetics 8.5(2.2) 8.9(2.6) 12.0(3.7) heart 23.3(4.1 ) 23.2(5.0) 35.8(4.5) splice 9.9(2.2) 10.2(1.6) 14.0(1.9)
For all of the three classifiers, the classification errors obtained by using the selected features are similar, or even slightly better than those obtained by using the original feature sets. This validates not only the high accuracy of the subject algorithm in estimating the relevance of features, but also that this estimation is not based on criteria suited for any particular classifier.
BREAST CANCER PROGNOSIS
This section addresses the problem of identifying a gene signature in microarray data for breast cancer prognosis. Accomplishing an accurate prediction of the likely course of breast cancer could have a huge humanitarian and economic impact. Through this experiment, it is illustrated how embodiments of the subject algorithm can have a broad impact on other areas, beyond machine learning.
Breast cancer is the second most common cause of death from cancer among women in the United States. In 2007, it is estimated that about 178,400 new cases of breast cancer will be diagnosed, and 40,400 women are expected to die from this disease (data from American Cancer Society, 2007). A major clinical problem of breast cancer is the recurrence of therapeutically resistant disseminated disease. Adjuvant therapy (chemotherapy and hormonal therapy) reduces the risk of distant metastases by one third; however, it is estimated that about 70% of patients receiving treatment would have survived without it. Being able to predict disease outcomes more accurately will help physicians make more informed decisions regarding the necessity of adjuvant treatment, and will lead to the development of individually tailored treatments with an enhanced efficacy. Consequently, this would ultimately contribute to a decrease in overall breast cancer mortality, a reduction in overall heath care costs, and an improvement in patients' quality of life by avoiding unnecessary treatments and their related toxic side effects.
Despite significant advances in primary cancer treatment, the prediction of the metastatic behavior of tumors remains one of the greatest clinical challenges in oncology. Two currently used treatment guidelines are the St. Galleii {see Goldhirsch et al., Meeting highlights: updated international expert consensus on the primary therapy of early breast cancer, ./. Clin. Oncol., 21, 3357-3365, 2003, which is hereby incorporated by reference in its entirety) and NIH {see Eifel et al., National Institutes of Health consensus development conference statement: adjuvant therapy for breast cancer, J Natl. Cancer Inst., 93, 979-989. 2000) consensus criteria, which determine whether a patient is at a high risk of tumor recurrence based on a panel of clinical markers. These criteria have only 10% specificity at the 90% sensitivity level. Here, specificity is defined as the rate of correctly predicting that there is no need for the adjuvant therapy when this therapy is indeed unnecessary, and sensitivity is the rate of administering the adjuvant therapy when the therapy is indeed needed and effective. A more accurate prognostic criterion is urgently needed to avoid over- or under-treatment in newly diagnosed patients.
Microarray technology, by monitoring the expression profiling of thousands of genes in a tissue simultaneously, has been shown to provide higher levels of accuracy than the current clinical systems. For example, out of tens of thousands of genes, 70- and 76-gene breast cancer prognostic signatures have been derived by van't Veer et al. {Nature 2002) and Wang et al. {Lancet 2005), respectively, achieving a much higher specificity, that of 50%, than the current clinical systems at the same sensitivity level. These results are considered groundbreaking in breast cancer prognosis. A large-scale clinical validation study involving thousands of breast cancer patients is currently being conducted in Europe to evaluate the prognostic value of the proposed signatures. However, both gene signatures are derived based on some simple computational algorithms, which leave much room for improvement by using advanced machine learning algorithms, as demonstrated in this section.
T he term "'detector" refers a component or a set of components that participates in the act of detecting an analyte. A detector that is specific for a particular analyte target is a component or set of components that is specialized in its capability for detecting; in the environment where employed, the main or principal capability of such a detector is detecting the particular analyte target. For example, a calcium-specific hollow cathode lamp in an atomic absorption spectrometer is a detector specific for calcium; such a lamp produces specialized wavelengths of light that are principally useful in determining how much calcium is in a sample. In a computerized atomic absorption spectrometer employing a wavelength- tunable diode laser, if a set of components necessarily cause the laser to be tuned to a wavelength principally suitable for detecting a particular element, then the set of components is a detector specific for that clement. For example, a computerized analytical instrument system may contain a set of hardware and/or software components that: (i) displays a choice of analytes on a screen, (ii) registers a user choice of particular analyte ''X", and (iii) in response, produces a signal that adapts the computerized analytical instrument system such that its main or principal capability is detecting analyte "X". RNA and DNA oligonucleotides, and commonly-known structural variations of these such as peptide nucleic acids, may be employed as detectors that are specific for their corresponding complementary sequences. Likewise, an antibody may be used to specifically detect protein(s) displaying a complimentary epitope.
Expression level. As used herein, "expression level" means an amount or concentration of gene product resulting from expression of a gene. An abnormally high or abnormally low expression level for a gene product may be indicative in some cases of a state of disease or disorder, in this case breast cancer. "Expression level" encompasses qualitative characterizations such as "presence" or '"absence" of a gene product in a sample.
Expression profile. As used herein, "expression profile" means a set of one or more expression levels.
Common methods for specifically assaying particular nucleic acid sequences (e.g., mRNA transcribed from a particular gene) frequently take advantage of the high specificity of complementary nucleic acid sequences, as in Northern or Southern blotting. As another example, in Asymetrix's GeneChip technology, small unique DNA probes are localized at potentially millions of individual spots ("features") on a single chip. RNA comprising the entire transcriptome of a cell sample may then be labeled for detection and hybridized against the unique DNA features on the chip. The amount of labeled RNA bound to a particular DNA feature on the chip then corresponds to the expression level of mRNA that is complementary to that DNA probe. Affymetrix typically uses DNA 25-mers as immobilized probes. Additional non-limiting examples of techniques for analyzing polynucleotides include cDNA microarrays, reverse transcription - polymerase chain reaction (RT-PCR), serial analysis of gene expression (SAGE), and branched nucleic acid methods such as Bayer's QuantiGene.
Gene product. As used herein, ''gene product" (or, equivalently, the '"product of a gene") is intended to be understood as commonly used by skilled practitioners to refer to a biochemical product expressed from a gene. For example, a gene may be transcribed to yield mRNA, and the niRNA may then be translated to give polypeptide. Both the mRNA and the polypeptide are gene products of the gene. Likewise, a cDNA molecule arising from the reverse transcription of mRNA can also be considered a gene product within the context of this invention. Some genes, for example rRNA genes and tRNA genes, may have an RNA gene product but not a polypeptide gene product.
Isolated or biologically pure. As used herein, the terms "isolated" or "biologically pure" refer to material that is substantially or essentially free from components which normally accompany the material as it is found in its native state.
Good Prognosis. In the context of this invention, a "good prognosis" is defined as a low likelihood/probability of cancer recurrence in a patient or as a high probability of having disease-free survival for a period of time post-treatment, typically 5 or 7 years.
Bad Prognosis. As used herein, a "bad prognosis" is defined as a high likelihood/probability of cancer recurrence in a patient or as a low probability of having disease-free survival for a period of time post-treatment, typically 5 or 7 years.
A set of four to five genes are identified, namely LOC58509, CEGPl, AL080059. ATP5E, and PRAME. that enable highly-accurate predictions of cancer recurrence in breast cancer patients. In some embodiments, PRAME can be omitted as a prognostic indicator of breast cancer recurrence. In other embodiments, any combination of two or more of the five genes may be used, along with up to 0. 5, 10, 25, 50, 100, 250, 500. 1000, 1500, or 2000 additional genes. For each embodiment employing the set of four or five genes, an analogous embodiment employing any combination of two or more of LOC58509. CEGPl, AL080059, ATP5E, and PRAME may be used, along with up to 0, 5. 10, 25, 50, 100, 250, 500, 1000, 1500, or 2000 additional genes.
EXEMPLARY EMBODIMENTS RELATING IN PART TO BREAST CANCER lhe invention includes, but is not limited to. the following embodiments: 1. A method of assigning a prognosis class to a patient comprising: a) obtaining (e.g , receiving or generating) data relating to said patient, wherein the data comprises or consists of a gene expression profile for a plurality of genes comprising or consisting of LOC58509, CEGPl. AL080059, ATP5E, and/or PRAME; and b) classifying the patient as belonging to a particular prognosis class based (e.g., a "good prognosis" class or a '"bad prognosis" class) upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME. and wherein the prognosis class is a categorization that is correlated with risk of cancer occurrence or recurrence.
2. A method of assigning treatment to a breast cancer patient comprising: a) obtaining a biological sample from the breast cancer patient; b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes comprising or consisting of LOC58509, CEGPl. AL080059, ATP5E, and/or PRAME; c) classifying the patient as belonging to a prognosis class based upon a subset of the gene expression profile, wherein the subset comprises or consists of gene expression levels for LOC58509, CEGPl, AL080059, ATP5E, and/or PRAME; and d) assigning treatment to the breast cancer patient based wholly or in part on the patient's prognosis class.
3. An article of manufacture comprising: a) an individual means for detecting expression products of the CEGPl, AL080059, TP5E and/or PRAME genes, wherein said means comprises, consists essentially of, or consists of polynucleotides that hybridize to said gene products; and b) the plurality of individual means for detecting the product(s) of a set of genes selected from LOC58509. CEGPl, AL080059, TP5E and/or PRAME.
4. The article of manufacture or an apparatus according to embodiment 3, wherein each individual means for detecting is a polynucleotide probe or a polynucleotide probe. The probe may be immobilized on a solid support. Where probes or polynucleotides are immobilized upon a solid support, the solid support can comprise, consist essentially of or consist of 8. 9, 10, I I , 12, 13, 14, 115. 16. 17, 18, 19, 20, 21, 22. 23. 24, 25, 26, 27, 28. 29, 30, 31, 32. 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63. 64. 65. 66, 67, 68, 69, 70, 71. 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82. 83, 84, 85, 86, 87. 88, 89, 90. 91, 92, 93, 94, 95, 96, 97, 98, 99 or more successive nucleotides of the EOC58509. CEGPl, AL080059, ATP5E or PRAME sequences (or polynucleotide sequences complementary thereto). As used herein, the term "fragment'" refers to a polynucleotide that is a consecutive span of nucleotides within the gene (e.g., LOC58509, CEGPl. AL080059, ATP5E and/or PRAME) that is smaller than the total length of the nucleotides encoding the gene (as exemplified by any of the attached sequences). Optionally excluded from the scope of the invention are any commercially available solid supports to which polynucleotides encoding LOC58509, CEGPl, AL080059, ATP5E and/or PRAME are immobilized.
5. A kit comprising from one to five containers, each container comprising a buffer and at least one polynucleotide encoding LOC58509. CEGPl , AL080059. ATP5E and, optionally, PRAME or fragments of said polynucleotide that hybridize with gene products of LOC58509, CEGPl, AL080059, ATP5E and, optionally, PRAME. In such an embodiment, the polynucleotide can comprise, consist essentially of or consist of 8. 9, 10, 11, 12, 13, 14, 1 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51. 52, 53, 54, 55, 56. 57. 58. 59. 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75. 76, 77, 78. 79, 80, 81. 82, 83, 84. 85. 86, 87, 88, 89, 90, 91 , 92, 93, 94, 95, 96, 97, 98, 99 or more successive nucleotides of the LOC58509. CEGPl. AE080059, A1T5E or PRAME polynucleotide sequences (or polynucleotide sequences complementary thereto).
6. The kit according to embodiment 5, wherein said kit contains fewer than five containers and each container comprises a buffer and two. three, four, or five polynucleotides encoding LOC58509, CEGPl. ΛL080059. A1P5E and, optionally, PRAME. polynucleotides that hybridize with EOC58509. CEGPl. AL080059, ATP5E and, optionally, PRAME or fragments of said polynucleotides that hybridi/e with gene products of LOC58509, CEGPl, AL080059, ATP5E and. optionally. PRAME. Another aspect of the invention provides a method for determining a prognosis in a subject, comprising detecting the presence of a combination of gene products (e.g.. mRNA transcripts or cDNA molecules) originating from LOC58509, CEGPl. AL080059. AlTSE, and FRAME (within this application, the term '"gene product(s)" may be substituted with "biomarker(s)'"). These gene products are measured in a biological sample from the subject, wherein the presence of the gene product, or a level (e.g., concentration) of the gene product above or below a pre-determined threshold, is indicative of a particular prognosis. In some embodiments, all four biomarkers (LOC58509. CEGPl , AL080059 and ATP5E are examined). Preferably, all five biomarkers (LOC58509. CEGPl, AL080059, ATP5E. and PRAME) are detected. In other embodiments, any combination of two or more of the five biomarkers are detected. Detection can be quantitative, qualitative, or semi-quantitative.
In another embodiment, the invention includes a method for prognostic evaluation of a subject having, or suspected of having, cancer, comprising: a) determining the level(s) of cancer biomarkers in a sample obtained from the subject; b) comparing the level(s) determined in step (a) to level(s) of the cancer biomarker(s) known to be present in samples obtained from previous cancer patients for whom outcomes are known; and c) determining the prognosis of the subject based on the comparison of step (b).
The methods, devices, and kits of the invention can be used for the analysis of cancer prognosis, disease progression and mortality. Depending upon the particular combination of cancer biomarker(s) used in the practice of the various disclosed embodiments of the invention, increased levels or decreased levels of detected biomarker in a sample compared to a standard may be indicative of advanced disease stage, residual tumor, and/or increased risk of disease progression and mortality. Specifically, the methods, devices and kits detect gene products of LOC58509, CEGPl, AL080059, ATP5E and/or FRAME or fragments thereof.
For some of the embodiments disclosed herein, the methods utilize an expression profile based upon a four gene or five gene signature (e.g., a) LOC58509, CEGPl, AL080059 and ATP5E or b) EOC58509, CEGPl. AL080059, ATP5E, and PRAME)) as set forth in the following Table 5 : Tabic 5. Expression Profile
c j . ,. Mean expression in
Gene name Sequence description recurrent tumors
LOC58509(4732) NY-REN-24 antigen Underexpressed
CEGP1(1O643) Homo sapiens CEGP 1 protein (CEGPl), Underexpressed mRNA.
AL080059( 10889) Homo sapiens mRNΛ; cDNA Overexpressed
DKFZp564H142 (from clone DKFZp564H142)
ATPSE(13800) ESTs Overexpressed
PRAME(8776) Preferentially expressed antigen in melanoma Overexpressed
Other analogous embodiments are contemplated wherein any combination of two or more of LOC58509, CEGPl, AL080059, ATP5E, and PRAME are used, along with up to 0, 5. 10, 25, 50, 100, 250, 500, 1000, 1500, or 2000 additional genes.
Nucleic Acids
Nucleic acids, including naturally occurring nucleic acids, oligonucleotides, antisense oligonucleotides, and synthetic oligonucleotides, that hybridize to the nucleic acid biomarkers of the invention (e.g., that hybridize to polynucleotide gene products corresponding to LOC58509, CEGPl, AL080059, ATP5E, and PRAME) are useful as agents to detect the presence of biomarkers of the invention in the biological samples of cancer patients or those at risk of cancer.
Although any length oligonucleotide may be utilized to hybridize to a nucleic acid that encodes a biomarker polypeptide, oligonucleotides comprising 8, 9, 10, 11, 12, 13, 14, 115, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33. 34, 35, 36, 37, 38, 39, 40, 41 , 42, 43, 44, 45, 46, 47, 48, 49, 50, 51 , 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80. 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98. 99 or more successive nucleotides of the LOC58509, CEGPl, AL080059, ATP5E or PRAME sequences (or polynucleotide sequences complementary thereto) are preferred. Most preferably, the oligonucleotides for use in detecting LOC58509, CEGPl, AL080059, ATP5E and PRAME in samples are those that range from 9 to 50 nucleotides. Also provided the subject invention are gene chips or arrays comprising or consisting of the various combinations of nucleic acids disclosed above (or herein) or nucleic acid sequences that hybridize to the disclosed nucleic acids. Certain embodiments provide an array of 2000 or fewer nucleic acid sequences which contain the 4 gene combination or 5 gene combination (sets of genes) disclosed above. The nucleic acids attached to the substrates of the gene chips can range from at least 8 consecutive nucleotides to the full length of any sequence disclosed herein. Oligonucleotides of the prognostic signature genes described herein can also be attached to a substrate for the production of a gene chip. Thus, a gene chip according to the invention comprises an array of 2000, 1500. 1000, 500 or 100 or fewer nucleic acid sequences which include those nucleic acids or oligonucleotides disclosed herein, are at least 8 consecutive nucleotides in length and are no longer than the complete sequence for a given SEQ ID NO:. A gene chip array of this invention contains 4 or 5 of the disclosed nucleic acid/oligonucleotide sequences (or those that are fully complementary thereto) and no more than 2000, 1500, 1000, 500 or 100 discrete nucleic acid sequences. An "array" represents an intentionally created collection of nucleic acid sequences or oligonucleotides that can be prepared either synthetically or biosynthetically. In particular, the term "array"' herein means an intentionally created collection of polynucleotides or oligonucleotides attached to at least a first surface of at least one solid support wherein the identity of each polynucleotide or oligonucleotide at a given predefined region is known. Thus, the subject invention provides a gene chip comprising an array of 2000 or fewer nucleic acid sequences or oligonucleotides, provided that said array includes the four-gene or five-gene prognostic signature disclosed above and said nucleic acid sequences or oligonucleotides are at least 8 consecutive nucleotides in length. The gene chips or arrays discussed herein can, in certain embodiments, contain anywhere from 8 to 50 (or more) consecutive nucleotides of the genes provided by the gene signature disclosed herein (or 8 to 50 or more consecutive nucleotides of a complementary sequence).
BREAST CANCER EXAMPLE — GENE SIGNATURE FOR PREDICTING RISK OF DISTANT RECURRENCE OF BREAST CANCER
The same dataset used by van't Veer et al. (Nature 2002) that spawned the 70-gene signature was analyzed. This dataset contains 24,481 probes that measure the gene expression levels of tumor samples collected from 97 lymph node-negative breast cancer patients. Among these patients, 46 developed distant metastases within 5 years, and 51 remained free of metastases for at least 5 years. The task was to identify a gene signature (i.e., the most relevant features) that enables accurate prediction of the risk of distant recurrence of breast cancer within a 5-year post-surgery period.
Due to the small sample size, the leave-one-out cross validation (LOOCV) method is used. In each iteration, one sample is held out for testing and the remaining samples are used for identifying a gene signature and training a classifier. Linear discriminant analysis (LDA) is used to estimate the predictive performances of the identified gene signature. One major advantage of LDA, compared to other classifiers, such as SVM. is that LDA has no structural parameters. To optimize the prediction performance, the input parameters are tuned via cross- validation. Specifically, the kernel width is set as σ = 4, and then LOOCV is performed on the training data to estimate the regularization parameter λ. Empirically, it is found that the algorithm yields nearly identical prediction performance for a wide range of other values for kernel width.
Fig. 9 depicts this experimental procedure. In each iteration, one sample is held out for testing and the remaining samples are used for training. The experimental protocol consists of inner and outer loops. In the inner loop, LOOCV is performed to estimate the optimal classification parameters based on the training data provided by the outer loop. In the outer loop, the held-out sample is classified by using the best parameters from the inner loop. The experiment is repeated until each sample has been used for testing.
The predictive values of the newly identified gene signature are demonstrated by comparing its performance with those of the St. Gallen criterion, the 70-gene signature of van't Veer et al. (Nature 2002), and the inventors' previously-derived hybrid signature which combines both clinical and genetic information (Sun et al., 2007a). The results of van't Veer et al. (Nature 2002) are reproduced by closely following their experimental procedure, wherein the top 70 genes are first identified, and then their predictive value is assessed by using a correlation based classifier. The performance of the 70-gene signature is also evaluated through LOOCV. where the held-out testing sample is not involved in the training process.
Fig. 10 presents a comparison between the receiver operating characteristic (ROC) curves of the 70-gene signature, the hybrid signature, and the gene signature obtained according to an embodiment of the present invention. To verify that the performance difference is not due to the use of different classifiers, a correlation based classifier is also applied to the gene signature obtained according to this embodiment. The legend New-LDA refers to the prediction performance of the gene signature according to the subject embodiment using LDA. and the legend New-Corr refers to the prediction performance of the gene signature according to the subject embodiment using a correlation based classifier. It can be observed that the gene signature obtained according to this embodiment significantly outperforms the 70-gene signature and the hybrid signature.
By following the study of van' t Veer et al. (Nature 2002), a threshold is set for each classifier so that the sensitivity of each classifier is equal to 90%. The corresponding specificities are reported in Table 6. For comparison, the specificity of St. Gallen criterion is also reported.
Figure imgf000035_0001
The 70-gene signature and hybrid signature each significantly outperform St. Gallen criterion, whereas the gene signature obtained according to the present invention improves the specificities of the 70-gene signature and St. Gallen criterion by about 40% and 70%, respectively. This appears to be the best prediction result for this dataset reported in the literature to date. fable 6 also presents the odds ratio of the two approaches for developing distant metastases within five years between the patients with a good prognosis and the patients with a bad prognosis. It can be observed that the gene signature according to the present invention gives a much higher odds ratio (66.0, 95% confidence interval (CI): 18.0 - 242.0) than either the 70-gene or hybrid signature. The difference is more than one error bar, and hence is statistically significant.
To further demonstrate the predictive value of the gene signature according to the present invention, survival data analyses of the four approaches are also performed. Figs. HA, HB, HC, and HD show Kaplan-Meier estimation of the probabilities of remaining distant metastases free in patients with a good or bad prognosis, determined by each approach. The p-value is computed by the use of a log rank test. The Mantel-Cox estimation of hazard ratio of distant metastases within five years for the new signature is 21.4 (95% CI: 7.3 - 63.0, p-value < 0.001), which is much larger than the 6.0 (95% CI: 2.0 - 17.0) of the 70- gene signature and the 11.1 (95% CI: 3.9-31.5) of the hybrid signature. At the 5-year defining point, all four approaches have similar low relapse rates in patients with a good prognosis, but the patients assigned to the bad-prognosis group by the gene signature obtained according to the present invention have a much lower probability of remaining free of distant metastases (0.14, 95% CI: 0.04 - 0.25) than those determined by the 70-gene signature (0.39, 95% CT: 0.28 - 0.51) and the hybrid signature (0.28, 95% CI: 0.18 - 0.40). The difference is more than one error bar, and hence is statistically significant.
Given a small sample size, as is this particular breast cancer dataset, each iteration in LOOCV may generate different gene signatures, since training data is different (see Simon, R. "Roadmap for developing and validating therapeutically relevant genomic classifiers" J Clin. Oncol., 23. 7332-7341, 2005). In the present experiments, the majority of LOOCV iterations identify the same gene signature that consists of only four gene markers, which are reported in Table 7. A fifth gene, PRAME. is identified in a minority of iterations and is also included in the new signature. The new gene signature is markedly shorter than the 70-gene signature. Fig. 12 plots the feature weights learned on one iteration of LOOCV according to an embodiment of the present invention. The genes presented along the x-axis are arranged based on the p-values of t-test in a decreasing order. For example, the first gene contains the most discriminant information according to the t-test. Some of the top-ranked features in the t-test are not selected in the gene signature by the subject algorithm. One possible explanation is that these excluded genes are redundant with respect to the identified gene signature.
Figure imgf000037_0001
Of the five identified gene markers ATP5E, LOC58509, AL080059, CEGPl, and PRAME, only CEGPl and AL080059 are listed in the 70-gene prognosis signature. No specific correlation of ATP5E subunit expression with cancer has been reported, but molecular defects that impinge on mitochondrial energy transduction do play a relevant role in the etiology of cancer by a number of mechanisms, including excessive reactive oxygen species production and metabolic stress-induced signaling that can enhance cellular invasive phenotypes (Amuthan et al., 2001). Accordingly, a number of metabolic markers, including Fl-ATP synthase subunits, have been suggested as potential prognostic indicators. For example, altered expression of the beta-catalytic subunit of the Fl-ATP synthase (b-Fl- ATPase) has been documented in human tumor biopsies of many tissues, including breast (Isidore et al., 2004). A proteomics examination of a large series of breast carcinomas showed that the alteration of the mitochondrial proteome, and specifically ATPase subunits, is a hallmark feature of breast cancer, and the expression level of b-Fl -ATPase allowed the identification of a subgroup of breast cancer patients with significantly worse prognosis (Isidore et al., 2005).
Information on the LOC58509 gene sequence is minimal to date. The sequence was originally identified in a cDNA library from brain, and is also known as C19orf29 and NY- REN-24. The latter name was assigned to the sequence because it was also identified in cDNA expression libraries derived from human tumors screened with autologous antibodies derived from renal cancer patients (Scanlan et al., 1999). No functional analysis for this gene product is available.
Analysis of the AL080059 sequence has revealed significant homology with the TSPY-like 5 (TSPYL5) gene, and with NAPs, factors which play a role in DNA replication (Schnieders et al., 1996). NAPs act as molecular chaperones, shuttling histones from their site of synthesis in the cytoplasm to the nucleus. Histone proteins are involved in regulating chromatin structure and accessibility and therefore can impact gene expression (Rodriguez et al., 1997), thus, a role in a tumor cell phenotype can be proposed.
The CEGPl gene (also known as SCUBE2) is located on human chromosome 1 IpI 5 and has homology to the basic helix-loop-helix (bHLH) family of transcription factors. The biological role of CEGPl /SCUBE2 is unknown, but the gene encodes a secreted cell-surface protein containing EGF and CUB domains (Yang et al., 2002). The EGF motif is present in many extracellular proteins that play an important role during development, and the CUB domain is found in several proteins involved in the regulation of extracellular process such as cell-cell communication and adhesion (Grimmond et al., 2000). The expression of CEGP1/SCUBE2 has been reported to be associated with estrogen receptor status in breast cancer specimens (Abba et al., 2005). Furthermore, expression of CEGP1/SCUBE2 has been detected in vascular endothelium and so may play important roles in angiogenesis (Yang et al., 2002).
Although both AL080059 and CEGPl were found to be significantly over-expressed in studies of a breast tumor metastasis model (Goodison et al., 2005), neither the AL080059 nor CEGPl genes have been evaluated independently in human cancers. The preferentially expressed antigen in melanoma (PRAME) gene has been linked to human disease, including cancer.
PRAME is a cancer-testis antigen (CTA), a group of tumor-associated antigens that represent possible target proteins for immuno-therapeutic approaches. Their expression is high in a variety of malignancies but is negligible in healthy tissues (Juretic et al., 2003). PRAME expression is evident in a large variety of cancer cells, including melanoma (Ikcda et al., 1997), squamous cell lung carcinoma, renal cell carcinoma, and acute leukemia (Matsushita et al.. 2003). Reports suggest that over-expression of PRAME in human cancers confers growth or survival advantages by antagonizing retinoic acid (RA) signaling, thus inhibiting RA-induced differentiation, growth arrest, and apoptosis (Epping et al.. 2005). This experiment further demonstrates the computational efficiency of the subject algorithm. Fig. 13 presents the CPU time it takes the subject feature selection algorithm to identify a gene signature for the breast cancer dataset with a varying number of genes, ranging from 500 to 24,481. it only takes about 22 seconds to process all 24,481 genes. If a filter method is first used to reduce the feature dimensionality to. say 2000, as is almost always done in microarray data analysis, the subject algorithm runs for only about two seconds.
Accordingly the subject algorithm runs several orders of magnitude faster than some state-of-the-art algorithms. The RFE algorithm is the well-known feature selection method specifically designed for microarray data analysis. It has been reported that the RFE algorithm (with the linear kernel) takes about 3 hours to analyze 72 samples with 7129 features. Another algorithm, referred to as Optimal Feature Weighting (see Gadat et a "A stochastic algorithm for feature selection in pattern recognition"1 J. Mack Learn. Res., 8, 509- 547. 2007), with a C++ compiler, takes about one hour on a dataset with 72 samples and 3859 features. The CPU times of both RFE and OFW are obtained on a leukemia dataset, which has similar characteristics as the breast cancer dataset.
PROSTATE CANCER PROGNOSIS
This section addresses the problem of identifying a gene signature in microarray data for prostate cancer prognosis. Accomplishing an accurate prediction of the likely course of prostate cancer could have a huge humanitarian and economic impact. Through this experiment, it is illustrated how embodiments of the subject algorithm can have a broad impact on other areas, beyond machine learning.
Prostate cancer is the most common male cancer by incidence, and the second most common cause of male cancer death in the United States. In 2007, it is estimated that approximately 218,890 new cases will be diagnosed and 27,050 men will die from this disease (data from the National Cancer Institute). The mortality rate for prostate cancer is declining due to improvements in earlier detection and in local therapy strategies. However, the ability to predict the metastatic behavior of a patient's cancer, as well as to detect and eradicate disease recurrence, remains one of the greatest clinical challenges in oncology. It is estimated that 25-40% of men undergoing radical prostatectomy will have disease relapse. The advent of microarray gene expression technology has greatly enabled the search for predictive disease biomarkers. By monitoring the expression profiling of tens of thousands of genes in a tissue simultaneously, transcriptional profiling can provide a wealth of data for correlation with disease status.
Definitions employed elsewhere in the specification, particularly in the breast cancer section, apply equally to the discussion of prostate cancer-related embodiments. Additionally, as used herein, "nomogram" refers to any prostate cancer prognosis predictor known in the art that accepts at least one input chosen from the set comprising prostate specific antigen level, prostate specific antigen doubling time, primary Gleason grade, secondary Gleason grade, sum of Gleason grades, clinical tumor stage according to a standardized clinical staging system, and the number of positive biopsy cores in conjunction with the number of negative biopsy cores.
Genes for which expression level was predictive of recurrence or non-recurrence of prostate cancer have been identified and converted into a prognostic signature (prognostic signature). Statistical analysis demonstrates that a cancer prognostic signature limited to expression levels of a relatively small set of identified prognostic marker genes (PCOLN3, TGFB3, PAK3, RBM34, RPL23, E124, FUT7, RlCS Rho, MAP4K4, CUTLl, ZNF324B and various combinations thereof). The expression levels of these genes can be used as a predictor of cancer recurrence. This prognostic signature can perform comparably to, or outperform, a commonly used prognostic tool, the postoperative nomogram. A prognostic signature combining the nomogram with expression levels for a subset of the identified prognostic marker genes gives additional predictive accuracy.
EXEMPLARY EMBODIMENTS RELATING IN PART TO PROSTATE CANCER
The subject invention provides a method of providing a prognosis for a patient comprising obtaining (e.g., generating or receiving) or supplying data for expression level(s) for one or more genes, selected from: PCOLN3, TGFB3, PAK3, RBM34, RPL23, EI24, FUT7, RICS Rho, MAP4K4, CUTLl , ZNF324B and various combinations thereof and assigning a prognosis to said patient on the basis of the expression levels of said genes. In certain embodiments, the expression and level of expression of the genes is used to assign a prognosis (e.g., a "good prognosis" or a "'bad prognosis") to the patient. Another aspect of the invention provides for methods of assigning a prognosis class to a patient comprising: obtaining (e g., receiving or generating) data relating to said patient, wherein the data comprises expression lcvel(s) for one or more genes, selected from: PCOLN3, TGFB3, PAK3, RBM34. RPL23, EI24, FUT7, RlCS Rho. MAP4K4, CUTLl , and/or ZNF324B or any combination thereof; and classifying the patient as belonging to a prognosis class (e.g., a ''good prognosis" or a "'bad prognosis") based upon expression levels for said genes and wherein a patient placed into a bad prognostic class is scheduled for increased surveillance for the recurrence of cancer. The phrase "'increased surveillance" is used herein to indicate that the patient is subjected to an increased frequency of testing for various cancer markers (e.g., body scans, increased frequency for prostate specific antigen testing and/or increased frequency of digital rectal examination).
A further aspect of the invention provides for the use of a postoperative nomogram evaluation of the patient in combination with either of the methods discussed above. Where a postoperative nomogram, such as any one of those provided on the world wide web at mskcc/html/10088.cfm.is utilized in the practice of this invention, one set of genes that can be used for determining a patient's prognostic status is:
Figure imgf000041_0001
Alternatively, the following set of genes can be used for the development of a patient's prognosis.
Figure imgf000041_0002
Figure imgf000042_0001
The subject application also provides the following method of assigning treatment to a prostate cancer patient comprising: a) assigning a prognosis class to the patient in accordance with any of the preceding embodiments; and b) providing treatment to the prostate cancer patient based wholly or in part on the patient's prognosis class.
Nucleic Acids
Nucleic acids, including naturally occurring nucleic acids, oligonucleotides, antisense oligonucleotides, and synthetic oligonucleotides, that hybridize to the nucleic acid biomarkers of the invention (e.g.. that hybridize to polynucleotide gene products corresponding to any one of SEQ ID NOs: 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15 or 16) are useful as agents to detect the presence of biomarkers of the invention in the biological samples of cancer patients or those at risk of cancer. The detection of these biomarkers can be peformed according to methods well-known in the art, such as those described in the examples.
Although any length oligonucleotide may be utilized to hybridize to a nucleic acid that encodes a biomarker polypeptide, oligonucleotides comprising 8, 9, 10, 1 1, 12, 13, 14, 1 15, 16, 17. 18. 19, 20, 21. 22, 23, 24, 25, 26. 27, 28, 29, 30, 31, 32, 33, 34. 35, 36, 37, 38, 39, 40, 41 , 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60. 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73. 74, 75, 76, 77, 78. 79, 80, 81 , 82, 83, 84, 85. 86. 87, 88. 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or more successive nucleotides of SRQ ID NO: 6, 7. 8, 9, 10. 11 , 12. 13. 14. 15 or 16 (or polynucleotide sequences complementary thereto) are preferred. Most preferabl}, the oligonucleotides are those within the range of 9 to 50 nucleotides.
Also provided by this application is a kit comprising from one to eleven containers, each container comprising a buffer and at least one polynucleotide selected from SEQ ID NO: 6, 7, 8. 9, 10, 11, 12, 13, 14, 15 or 16, fragments of said polynucleotide, or polynucleotides that hybridize with SEQ ID NO: 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 or 16. The kit may contain fewer than eleven containers and each container may comprise a buffer and a combination of any two, three, four, five, six, seven, eight, nine, ten or 11 polynucleotides as set forth in the Nucleotide Combinations table provided below. The values set forth in the Nucleotide Combinations correspond to the polynucleotides as identified by each respective SEQ ID NO: (see Table 9). When referring to polynucleotides related to prostate cancer prognosis, the number '"1" in the Nucleotide Combinations table refers to the lowest SEQ ID NO appearing in Table 9 (i.e., "1" refers to SEQ ID NO: 6, corresponding to PAK3). Likewise, "2" refers to the second-lowest SEQ ID NO appearing in Table 9 (i.e., "2" refers to SEQ ID NO: 7, corresponding to RPL23), and so on. This is summarized by the following ordered pairs, wherein the first numeral is the numeral appearing in the Nucleotide Combinations table and the second numeral is the prostate cancer prognosis-related SEQ ID NO that is thereby referenced: (1,6); (2,7); (3,8); (4,9); (5,10); (6,11); (7,12); (8,13); (9,14); (10,15): and (11,16).
Also provided the subject invention are gene chips or arrays comprising or consisting of the various combinations of nucleic acids or oligonucleotides disclosed in the Nucleotides Combinations table or those nucleic acid or oligonucleotide sequences that hybridize to those nucleic acids. Certain embodiments provide an array of 2000 or fewer nucleic acid sequences which contain the 5 gene combination or 1 1 gene combination (sets of genes) disclosed above. The nucleic acids or oligonucleotides attached to the substrates of the gene chips can range from 8 consecutive nucleotides to the full length of any sequence disclosed herein. Thus, a gene chip according to the invention comprises an array of 2000, 1500, 1000, 500 or 100 or fewer nucleic acid or oligonucleotide sequences which include those nucleic acids disclosed herein, are at least 8 consecutive nucleotides in length and arc no longer than the complete sequence for a given SEQ ID NO:. A gene chip array of this invention contains at least 5, 6. 7, 8. 9, 10 or 1 1 of the disclosed nucleic acids sequences or oligonucleotides (or those that are fully complementary thereto) and no more than 2000, 1500, 1000, 500 or 100 discrete/different nucleic acid sequences. An "array" represents an intentionally created collection of nucleic acid or oligonucleotide sequences that can be prepared either synthetically or biosynthetically. In particular, the term "array" herein means an intentionally created collection of polynucleotides or oligonucleotides attached to at least a first surface of al least one solid support wherein the identity of each polynucleotide at a given predefined region is known. Thus, the subject invention provides a gene chip comprising an array of 2000 or fewer nucleic acid sequences, provided that said array includes the five gene or eleven prognostic signature disclosed above and said nucleic acid sequences are at least 8 consecutive nucleotides in length. The gene chips or arrays discussed herein can, in certain embodiments, contain anywhere from 8 to 50 consecutive nucleotides of the genes provided by the prognostic signatures disclosed herein.
Following are examples that illustrate materials, methods, and procedures for practicing the invention. The examples are illustrative and should not be construed as limiting.
MATERIALS AND METHODS RELATING TO PROSTATE CANCER EXAMPLE
Dataset. The dataset was built from tissue samples obtained from 79 patients with clinically localized prostate cancer treated by radical prostatectomy at MSKCC between ] 993 and 1999. Thirty-nine cases had disease recurrence as classified by 3 consecutive increases in the serum level of PSA after radical prostatectomy, and forty samples were classified as nonrecurrent samples by virtue of maintaining an undetectable PSA (< 0.05 ng/mL) for at least 5 years after radical prostatectomy. No patient received any neoadjuvant or adjuvant therapy before documented disease recurrence. Samples were snap frozen, examined histologically and enriched for neoplastic epithelium by macrodissection. Gene expression analysis was carried out using the Affymetrix U133A human gene array which has 22,283 features for individual gene/EST clusters, as per manufacturer's instructions. Image processing was performed using Affymetrix Microarray Suite 5.0 to produce CEL files which were used directly in the present inventors' analyses.
In many microarray analyses to date, for genes to be incorporated into an MSKCC- based model, they were filtered using a variety of criteria that included a significant differential expression between the two classes (p-value < 0.001), a fold change >1.3, and a '"present" call in greater than 80% of the samples in either class. Such filter methods with arbitrary cut-off thresholds may introduce bias. It is preferable to allow the computer to decide which genes are useful for prediction, without the use of any arbitrary pre-processing filters. Except for a simple re-scaling of the expression values of each gene to be between 0 and 1, no other preprocessing was performed. Experiment Procedure. To avoid possible overfitting of a computational model to training data, the present inventors used a rigorous experimental protocol with the leave-one- out cross validation (LOOCV) method to estimate classifier parameters and prediction performance (Wessels et al. "A protocol for building and evaluating predictors of disease state based on micrarray data" Bioinformatics, Vol. 21, pp. 3755-62), as depicted in Figure 9. The experimental protocol consists of inner and outer loops. In the inner loop, LOOCV is performed to estimate the optimal classifier parameters based on the training data provided by the outer loop, and in the outer loop, a held-out sample is classified using the best parameters from the inner loop. The experiment is repeated until each sample has been tested. The classification parameters that need to be specified in the inner loop include the kernel width and sparsity parameter of the feature selection algorithm, as well as the structural parameters of a classifier, which leads to a multi-dimensional parameter search. To make the experiment computationally feasible, the present inventors adopted some heuristic simplifications. Linear discriminant analysis (LDA) was used to estimate classification performances. One major advantage of LDA, compared to other classifiers (e.g., SVM and neural networks (Bishop, 2006, Pattern Recognition and Macine Learning (Springer)), is that LDA has no structural parameters. The present inventors predefined the kernel width as 5, and estimated the sparsity parameter through LOOCV in the inner loop. In simulations, the present inventors found that the choice of the kernel width is not critical, and the algorithm performs similarly for a large range of values for this parameter.
Statistical Analysis. Kaplan-Meier survival plots and log-rank tests (Kirkwood et al, 2003, Essential Medical Statistics (Blackwell Science) were used to assess the predictive values of different prognostic approaches. The Mantel-Cox estimation of hazard ratio was performed to quantify the relative risk of biochemical recurrence in the bad-prognosis group compared with the good-prognosis group. A hazard ratio above 1.0 indicates that the patients assigned to the bad-prognosis group have a higher probability to develop disease recurrence than those in the good-prognosis group. In most microarray data analyses, the numbers of available patient samples are usually quite small, and some performance measurements {e.g., hazard ratios) arc heavily influenced by the choice of a decision threshold. A receiver operating characteristic curve obtained by varying a decision threshold provides a direct view on how a predictive approach performs at the different sensitivity and specificity levels. The specificity is defined as the probability that a patient who did not experience disease recurrence was assigned to the good-prognosis group, and the sensitivity is the probability that a patient who developed disease recurrence was assigned to the bad-prognosis group. The most frequently used criterion for comparing multiple ROC curves is the area under a ROC curve, commonly denoted as AUC, which can range from 0.5 (no discrimination) to 1.0 (perfect ability to discriminate). MedCalc version 8.0 (MedCalc Software, Mariakerke. Belgium) was used to perform the ROC curve analysis. A p-value of 0.05 is considered statistically significant.
It should be understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and arc to be included within the spirit and purview of this application.
PROSTATE CANCER EXAMPLE -DETERMINATION OF PROGNOSTIC MARKERS
The present inventors developed two computational models to predict the biochemical recurrence of prostate cancer. The first model is based exclusively on gene expression data obtained from tissue samples, and the second combines the predictive information of both genetic and clinical variables. Specifically, in the latter combination (or hybrid) model the clinical variable used was the 7-year probability of disease recurrence estimated by the postoperative nomogram.
ROC curve analysis was performed to compare the prediction performance of the two novel prognosis models and the nomogram alone (Figure 14). The nomogram performed reasonably well, consistent with multiple studies reported in the literature (Stephenson et al, 2005, Cancer, Vol. 104, pp. 290-298; Stephenson et al, 2005, J. Clin. Oncol, Vol. 23, pp. 7005-7012), but the genetic model predicted disease recurrence more accurately than the nomogram, specifically in the high specificity region. At the 90% sensitivity level, the genetic signature correctly classified 69 out of 79 samples (87%), including 34 non-recurrent and 35 recurrent tumors. This is the first reported genetic signature to outperform the clinically used predictive nomogram. Furthermore, a hybrid signature derived by combining the gene expression data with clinical information outperformed both the nomogram and the genetic signature. At the 90% sensitivity level, the hybrid signature improved the specificities of the genetic model and nomogram by about 10% and 20%, respectively. It correctly classified 74 out of 79 samples (94%), including 38 non-recurrent and 36 recurrent tumors (Figure 14). Statistical analysis of the ROC curves revealed the predictive accuracy of the hybrid signature to be significantly superior to that of the postoperative nomogram (p-value < 0.0001) and the gene-expression model (p-value < 0.05). The odds ratios (OR) of the hybrid and genetic models, reported in Table 8, show that the patients assigned to the bad-prognosis group are 18.2 (95% CI: 5.9 - 56.2) and 16.5 (95% Cl: 5.4 - 5 1 .0) times more likely to develop disease recurrence than those assigned to the good-prognosis group, respectively, which is much higher than that of the nomogram (8.4. 95% CI: 2.9 - 24.6).
Figure imgf000047_0001
To further demonstrate the predictive value of the three approaches in assessing the risk of biochemical recurrence in prostate cancer patients, survival data analyses were performed (see Figs. 15A-C). The Kaplan-Meier curve of the hybrid model, plotted in Figure 15A, shows a significant difference in the probability of remaining free of disease recurrence in patients with good or bad prognosis (p-value < 0.001). The Mantel-Cox estimate of hazard ratio of biochemical recurrence of prostate cancer within five years for the hybrid model is 29.1 (95% CI: 8.3 - 102.1), which is much larger than those of either the nomogram (11.9, 95% CT: 3.8 -36.9) or the genetic model (18.0, 95% CI: 5.9 - 54.5). At the 5-year end point, all three approaches have similar low relapse rates in patients with good prognosis, but the patients assigned to the bad-prognosis group b> the hybrid model have a much lower probability of remaining free of disease recurrence (0.21 , 95% CI: 0.07 - 0.33) than that determined by the nomogram (0.35, 95% CI: 0.22 - 0.50). The difference is more than one error bar, and hence is statistically significant.
With a small sample size, each iteration in LOOCV may generate a different prognostic signature since the training data used is different (Simon, 2005, J Clin One , Vol. 23, pp. 7332-41). In the genetic modeling approach, a 5, 6, 7 and 8-gene model was developed in 7, 43. 24 and 5 iterations, respectively. The 79 total iterations correspond to one iteration for each of the 79 patients in the dataset, as dictated by the LOOCV method. A total of 1 1 unique genes were identified in the consensus genetic prognostic signature (Table 9). The mean expression of each gene in the 79 tumor samples obtained from patients with, and without, disease recurrence was visualized by creating individual scatter plots. The observed pattern (under- or over-expressed) in the recurrent cases for each gene, and the frequency of occurrence of each gene over 79 algorithm iterations, are listed in Table 9. Λ high occurrence rate is an indication of the importance of the corresponding gene for predicting disease recurrence. In the hybrid modeling approach, the nomogram output was selected in all 79 iterations, and 4, 5, and 6 genes were identified in 69, 9, and 1 iteration(s), respectively. A total of 5 unique genes were included in the consensus hybrid model. Notably, all of these genes were also present in the genetic model, and three genes (PAK3, RPL23, and EI24) occurred at a high frequency in both the hybrid modeling and the pure genetic modeling (Tables 9 and 10).
Figure imgf000048_0001
Figure imgf000049_0001
Discussion. The present application provides a genetic signature that predicts disease recurrence after radical prostatectomy with 87% overall accuracy. Furthermore, a hybrid signature derived by combining the gene expression data with the 7-year PFP score outperformed both the nomogram and the genetic signature, correctly classifying 74 out of 79 samples. Statistical analyses also clearly demonstrated the superiority of the hybrid signature over a prognostic system that uses only genetic or clinical markers. Though the nomogram performs very well when the estimated 7-year disease prognosis-free probability is larger than 90%. it assigns a significant number of non-recurrence patients to the bad prognosis group. It is evident in Figures 14, 15, and 16 that microarray data provides additional information to stratify these patients.
Three genes that were most highly weighted in both the genetic and hybrid signatures were RPL23, EI24, and PAK3. RPL23 is a member of the ribosomal protein family that acts to stabilize rRNA structure, regulate catalytic function, and integrate translation with other cellular processes, but recent studies have shown that many ribosomal proteins have extra- ribosomal cellular functions independent of protein biosynthesis. EI24/PIG8 is localized in the endoplasmic reticulum (ER), and by virtue of its binding Bcl-2, has been linked with the modulation of apoptosis. PAK3 is a Group I member of the p21 -activated kinase (Pak) family serine/threonine protein kinases that bind to and modulate the activity of the small GTPases, Cdc42 and Rac. GTPase signaling controls many aspects of cellular response to the environment, and through these interactions, PΛKs have been shown to be involved in the regulation of cellular processes such as gene transcription, cell morphology, motility, and apoptosis.
As well as an impact on clinical decision-making, it is hoped that microarray data will advance the understanding of cancer biology, which in turn will inform the development of new and effective therapies. The fact that diagnostic and prognostic signatures reported to date have been composed of tens or hundreds of genes means that the choosing of genes to study functionally remains difficult and somewhat arbitrary. A major advantage of deriving accurate prognostic signatures comprising just a few genes, is that it greatly facilitates the task of functional investigation. The number of genes was further reduced to 5 in the present inventors' clinical-genetic hybrid signature, and it is notable that all 5 genes were also amongst the 11 genes comprising the consensus genetic signature. This was not necessarily to be expected, because the analysis used to derive the hybrid signature was not in any way informed by the genetic signature analysis. While they used the same raw data, the two signatures were derived entirely independently.
All patents, patent applications, provisional applications, and publications referred to or cited herein are incorporated by reference in their entirety, including all figures and tables, to the extent they are not inconsistent with the explicit teachings of this specification.
It should be understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application. Each method herein optionally includes taking a biological sample from the patient and optionally includes analyzing the sample using an article of manufacture or kit of the invention.
SEQ ID NO: 1 (LOC58509. presented here as sequence for C19orf29)
1 GAAGCAGGAG CAGGGCGTGG AGAGCGAGCC GCTGTTCCCC ATCCTCAAGC AGGAGCCCCA
61 GTCCCCCAGC CGCAGCCTGG AGCCTGAGGA CGTGGCGCCC ACCCCGCCCG GGCCCTCCTC
121 GGAGGGCGGC CCCGCGGAGG CCGAGGTGGA CGGCGCGACC CCGACAGAGG GCGACGGCGA
181 CGGGGACGGT GAGGGCGAGG GCGAGGGCGA GGCGGTGCTC ATGGAGGAGG ACCTGATCCA
241 GCAGAGCCTG GACGACTACG ACGCCGGCAG GTACAGCCCG CGGCTGCTCA CGGCGCACGA
30] GCTGCCACTG GACGCGCACG TGCTGGAACC GGATGΆGGAC CTGCAGCGCC TGCAGCTCTC
361 GCGCCAGCAG CTCCAGGTCA CGGGAGACGC CAGCGAGAGC GCCGAGGACA TCTTCTTCCG
421 GCGGGCCAAG GAGGGCATGG GCCAGGACGA GGCGCAGTTC AGCGTGGAGA TGCCACTCAC
481 CGGCAAGGCC TACCTGTGGG CCGACAAGTA CCGGCCACGC AAGCCGCGCT TCTTCAACCG
541 CGTGCACACG GGCTTCGAGT GGAACAAGTA CAACCAGACG CACTACGACT TTGACAACCC
601 ACCGCCCAAG ATCGTGCAGG GATACAAGTT CAACATCTTC TACCCCGACC TCATCGACAA
661 GCGCTCCACG CCCGAGTACT TCCTGGAGGC CTGCGCCGAC AACAAGGATT TCGCCATCCT
721 GCGCTTCCAC GCGGGGCCGC CCTACGAGGA CATCGCTTTC AAGATCGTCA ACCGCGAGTG
781 GGAATACTCG CACCGCCACG GCTTCCGCTG CCAGTTTGCC AATGGCATCT TCCAGCTGTG
841 GTTTCACTTC AAGCGCTACC GCTATCGGCG GTGACGGCCC TGGGGAACGG CAGGCCAGGA
901 GGGCCGAGGG CCACACGGGT GCCACAGCCC AGGTCGGAGT GGCCCAGCCG GCAGGCTTGT
961 TCTTCAGCAT CCGACGGGAA CATCTCCAAC AGAAGCAAAA CGGAAAGTGC CTCCCGGACC
1021 CCCAGAGGGC CACCCAACCT CACCAGTCAC CAGCCCCAGA CCACCCACAG CCCCTCCCAG
1081 ACACCCCGCC TCATCTGGAA ATAGTTCCGT TTGTTTCTCT AAAAAGACTT GTAGGTGGGA
1141 AAAAAAATCT TTTGTTCTCA TGGAATTGGC CTATTGGCAA GATCGCATGT TTTTTTAATA
1201 AACGTTGTAT TTTAGAATAA AAAAAAAAAA AAA
SEQ ID NO: 2 (CEGPl, presented here as sequence for SCUBE2)
1 GGCGTCCGCG CACACCTCCC CGCGCCGCCG CCGCCACCGC CCGCACTCCG CCGCCTCTGC
61 CCGCAACCGC TGAGCCATCC ATGGGGGTCG CGGGCCGCAA CCGTCCCGGG GCGGCCTGGG
121 CGGTGCTGCT GCTGCTGCTG CTGCTGCCGC CACTGCTGCT GCTGGCGGGG GCCGTCCCGC
181 CGGGTCGGGG CCGTGCCGCG GGGCCGCAGG AGGATGTAGA TGAGTGTGCC CAAGGGCTAG
241 ATGACTGCCA TGCCGACGCC CTGTGTCAGA ACACACCCAC CTCCTACAAG TGCTCCTGCA
301 AGCCTGGCTA CCAAGGGGAA GGCAGGCAGT GTGAGGACAT CGATGAATGT GGAAATGAGC
361 TCAATGGAGG CTGTGTCCAT GACTGTTTGA ATATTCCAGG CAATTATCGT TGCACTTGTT
421 TTGATGGCTT CATGTTGGCT CATGACGGTC ATAATTGTCT TGATGTGGAC GAGTGCCTGG
481 AGAACAATGG CGGCTGCCAG CATACCTGTG TCAACGTCAT GGGGAGCTAT GAGTGCTGCT
541 GCAAGGAGGG GTTTTTCCTG AGTGACAATC AGCACACCTG CATTCACCGC TCGGAAGAGG
601 GCCTGAGCTG CATGAATAAG GATCACGGCT GTAGTCACAT CTGCAAGGAG GCCCCAAGGG
661 GCAGCGTCGC CTGTGAGTGC AGGCCTGGTT TTGAGCTGGC CAAGAACCAG AGAGACTGCA
721 TCTTGACCTG TAACCATGGG AACGGTGGGT GCCAGCACTC CTGTGACGAT ACAGCCGATG
781 GCCCAGAGTG CAGCTGCCAT CCACAGTACA AGATGCACAC AGATGGGAGG AGCTGCCTTG
841 AGCGAGAGGA CACTGTCCTG GAGGTGACAG AGAGCAACAC CACATCAGTG GTGGATGGGG
901 ATAAACGGGT GAAACGGCGG CTGCTCATGG AAACGTGTGC TGTCAACAAT GGAGGCTGTG
961 ACCGCACCTG TAAGGATACT TCGACAGGTG TCCACTGCAG TTGTCCTGTT GGATTCACTC
1021 TCCAGTTGGA TGGGAAGACA TGTAAAGATA TTGATGAGTG CCAGACCCGC AATGGAGGTT
1081 GTGATCATTT CTGCAAAAAC ATCGTGGGCA GTTTTGACTG CGGCTGCAAG AAAGGATTTA
1141 AATTATTAAC AGATGAGAAG TCTTGCCAAG ATGTGGATGA GTGCTCTTTG GATAGGACCT
1201 GTGACCACAG CTGCATCAAC CACCCTGGCA CATTTGCTTG TGCTTGCAAC CGAGGGTACA
1261 CCCTGTATGG CTTCACCCAC TGTGGAGACA CCAATGAGTG CAGCATCAAC AACGGAGGCT 1321 GTCAGCAGGT CTGTGTGAAC ACAGTGGGCA GCTATGAATG CCAGTGCCAC CCTGGGTACA
1381 AGCTCCACTG GAATAAAAAA GACTGTGTGG AAGTGAAGGG GCTCCTGCCC ACAAGTGTGT 1441 CACCCCGTGT GTCCCTGCAC TGCGGTAAGA GTGGTGGAGG AGACGGGTGC TTCCTCAGAT
1501 GTCACTCTGG CATTCACCTC TCTTCAGATG TCACCACCAT CAGGACAAGT GTAACCTTTA 1561 AGCTAAATGA AGGCAAGTGT AGTTTGAAAA ATGCTGAGCT GTTTCCCGAG GGTCTGCGAC
1621 CAGCACTACC AGAGAAGCAC AGCTCAGTAA AAGAGAGCTT CCGCTΆCGTA AACCTTACAT
1681 C-CAGCTCTGG CAAGCAAGTC CCAGGAGCCC CTGGCCGACC AAGCACCCCT AAGGAAATGT 1741 TTATCACTGT TGAGTTTGAG CTTGAAACTA ACCAAAAGGA GGTGACAGCT TCTTGTGACC 180] TGAGCTGCAT CGTAAAGCGA ACCGAGAAGC GGCTCCGTAA AGCCATCCGC ACGCTCAGAA 1861 AGGCCGTCCA CAGGGAGCAG TTTCACCTCC AGCTCTCAGG CATGAACCTC GACGTGGCTA 1921 AAAAGCCTCC CAGAACATCT GAACGCCAGG CAGAGTCCTG TGGAGTGGGC CAGGGTCATG 1981 CAGAAAACCA ATGTGTCAGT TGCAGGGCTG GGACCTATTA TGATGGAGCA CGAGAACGCT
2041 GCATTTTATG TCCAAATGGA ACCTTCCAAA ATGAGGAAGG ACAAATGACT TGTGAACCAT
2101 GCCCAAGACC AGGAAATTCT GGGGCCCTGA AGACCCCAGA AGCTTGGAAT ATGTCTGAAT
2161 GTGGAGGTCT GTGTCAACCT GGTGAATATT CTGCAGATGG CTTTGCACCT TGCCAGCTCT
2221 GTGCCCTGGG CACGTTCCAG CCTGAAGCTG GTCGAACTTC CTGCTTCCCC TGTGGAGGAG
2281 GCCTTGCCAC CAAACATCAG GGAGCTACTT CCTTTCAGGA CTGTGAAACC AGAGTTCAAT
2341 GTTCACCTGG ACATTTCTAC AACACCACCA CTCACCGATG TATTCGTTGC CCAGTGGGAA
2401 CATACCAGCC TGAATTTGGA AAAAATAATT GTGTTTCTTG CCCAGGAAAT ACTACGACTG 2461 ACTTTGATGG CTCCACAAAC ATAACCCAGT GTAAAAACAG AAGATGTGGA GGGGAGCTGG 2521 GAGATTTCAC TGGGTACATT GAATCCCCAA ACTACCCAGG CAATTACCCA GCCAACACCG
2581 AGTGTACGTG GACCATCAAC CCACCCCCCA AGCGCCGCAT CCTGATCGTG GTCCCTGAGA 2641 TCTTCCTGCC CATAGAGGAC GACTGTGGGG ACTATCTGGT GATGCGGAAA ACCTCTTCAT
2701 CCAATTCTGT GACAACATAT GAAACCTGCC AGACCTACGA ACGCCCCATC GCCTTCACCT
2761 CCAGGTCAAA GAAGCTGTGG ATTCAGTTCA AGTCCAATGA AGGGAACAGC GCTAGAGGGT
2821 TCCAGGTCCC ATACGTGACA TATGATGAGG ACTACCAGGA ACTCATTGAA GACATAGTTC
2881 GAGATGGCAG GCTCTATGCA TCTGAGAACC ATCAGGAAAT ACTTAAGGAT AAGAAACTTA
2941 TCAAGGCTCT GTTTGATGTC CTGGCCCATC CCCAGAACTA TTTCAAGTAC ACAGCCCAGG
3001 AGTCCCGAGA GATGTTTCCA AGATCGTTCA TCCGATTGCT ACGTTCCAAA GTGTCCAGGT 3061 TTTTGAGACC TTACAAATGA CTCAGCCCAC GTGCCACTCA ATACAAATGT TCTGCTATAG 3121 GGTTGGTGGG ACAGAGCTGT CTTCCTTCTG CATGTCAGCA CAGTCGGGTA TTGCTGCCTC 3181 CCGTATCAGT GACTCATTAG AGTTCAATTT TTATAGATAA TACAGATATT TTGGTAAATT
3241 GAACTTGGTT TTTCTTTCCC AGCATCGTGG ATGTAGACTG AGAATGGCTT TGAGTGGCAT
3301 CAGCTTCTCA CTGCTGTGGG CGGATGTCTT GGATAGATCA CGGGCTGGCT GAGCTGGACT
3361 TTGGTCAGCC TAGGTGAGAC TCACCTGTCC TTCTGGGGTC TTACTCCTCC TCAAGGAGTC
3421 TGTAGTGGAA AGGAGGCCAC AGAATAAGCT GCTTATTCTG AAACTTCAGC TTCCTCTAGC 3481 CCGGCCCTCT CTAAGGGAGC CCTCTGCACT CGTGTGCAGG CTCTGACCAG GCAGAACAGG 3541 CAAGAGGGGA GGGAAGGAGA CCCCTGCAGG CTCCCTCCAC CCACCTTGAG ACCTGGGAGG 3601 ACTCAGTTTC TCCACAGCCT TCTCCAGCCT GTGTGATACA AGTTTGATCC CAGGAACTTG 3661 AGTTCTAAGC AGTGCTCGTG AAAAAAAAAA GCAGAAAGAA TTAGAAATAA ATAAAAACTA 3721 AGCACTTCTG GAGACAT
SEQ ID NO: 3 (AL080059)
1 GACGTCGGCG GTTGCCGTGG AGAGACCCGC TCGGTCCCGA CCGAGAGCTG GCGTCAGGAG
61 CCGCAGGGTC ACAGCGTGTC TTTGAAGCTG CCTCCGCCGC CACCATGAGC GGCCGAAGTC
121 GGGGTCGAAA GTCCTCCCGC GCCAAAAACC GGGGCAAAGG CCGCGCCAAA GCCCGAGTCC
181 GCCCTGCTCC GGACGACGCC CCGCGCGACC CGGACCCTTC ACAGTACCAG AGTCTCGGGG
241 AAGACACCCA GGCGGCACAG GTGCAGGCTG GCGCGGGGTG GGGTGGCCTG GAAGCCGCTG
301 CGTCCGCGCA GCTCCTCCGG CTCGGGGAGG AGGCCGCCTG CCGGCTCCCC CTGGACTGTG
361 GCCTCGCGCT GCGGGCCCGA GCTGCGGGGG ACCACGGGCA GGCCGCGGCC AGGCCCGGCC
421 CGGGGAAGGC CGCATCTCTC TCGGAGCGCC TGGCCGCAGA CACTGTCTTC GTGGGAACAG
481 CGGGAACCGT GGGAAGGCCG AAAAATGCCC CCCGCGTTGG AAACCGGCGT GGCCCTGCCG
541 GGAAGAAGGC CCCAGAAACC TGTAGCACCG CGGGGAGGGG GCCTCAGGTC ATAGCTGGTG
601 GGAGGCAGAA GAAAGGGGCG GCAGGGGAGA ATACCTCGGT GTCAGCTGGG GAGGAAAAGA
661 AGGAAGAGAG GGATGCAGGG TCGGGGCCCC CAGCGACGGA AGGCAGCATG GATACGCTGG
721 AGAACGTGCA GCTGAAGCTG GAGAACATGA ACGCCCAGGC GGACAGGGCC TACCTTCGGC
781 TCTCCAGGAA GTTTGGGCAG TTGCGACTGC AGCACTTGGA GCGCAGGAAC CACCTCATCC
841 AAAATATCCC GGGCTTCTGG GGGCAAGCAT TTCAGAACCA TCCCCAGCTA GCATCCTTTC
901 TGAATAGCCA AGAGAAAGAG GTACTGAGCT ACTTAAACAG CTTGGAAGTG GAAGAGCTCG
961 GCCTTGCCAG ATTGGGCTAC AAAATCAAGT TCTACTTCGA TCGCAACCCG TATTTCCAAA
1021 ATAAGGTGCT CATCAAGGAA TATGGGTGTG GTCCTTCTGG CCAGGTGGTG TCTCGTTCTA
1081 CTCCAATCCA GTGGCTCCCA GGGCATGATC TCCAGTCCCT AAGCCAGGGA AACCCAGAAA
1141 ACAACCGTAG TTTCTTTGGG TGGTTTTCAA ACCACAGCTC CATTGAGTCT GACAAGATTG
1201 TGGAGATAAT CAACGAAGAA TTGTGGCCCA ATCCCTTGCA GTTCTACCTT TTGAGTGAAG
1261 GGGCTCGTGT AGAGAAAGGA AAGGAAAAAG AAGGCAGGCA AGGTCCAGGA AAGCAGCCAA
1321 TGGAGACTAC TCAGCCTGGG GTGAGCCAAT CCAACTGATC CTGCACAAGT CTCCCTGCTA
1381 CCTCCTGCTG GGCTGCTCCT GGTTGACTAC ATATTTGGCT CTCTGCCTTC TCTTCATGTT
1441 GGCCTCTGTG TACTATAGAC CCTTTGGACT TTAGTTACAA GAATATGGCA TTTATGATCA 1501 TGTTCTTTTG CCTCCCATAG CCTCTTGCTT TTCTACGTTA GTGTCACATG TTATGTGATC 1561 GCACCCGGAC TGTTCATATG TTAAGCACAC ATACTTTAGA TGTGTGCTCC CTGTTATCTA 1621 CATTGCCTGG ACCTTGTTCT TGGCTCTGGC CTTTCCTACT TGTCTTTAAC ATGGACACTC
1681 ATGAGTCATC CTCCAGGACG ACCAGTGCAT CCTCAAGCTC TGCTCCTTCA GGGCCAGTGA 1741 CTTGCTGGGC TTCGTGTTCA TGCTGGCCAT GCATGTCATC TATGCAAGCG TTTCCATTGC 1801 CACTGCAGAA TGAAGCCAGT GCACATGGTA TTAGTCATCA GCCAGAACTT CCTGTTCTGT
1861 GGGTCTGGGG TCACCAGCTG AGATGTCACC TGCTTTATTT CTGGCTTTGG CCTGTGGTCT 1921 GTGATACCCA TCCTGCTTGA TGTTCTGCAG AATGGCACTT GACTGCTGGG CATGCATGAA
1981 GTTAAGGGCA AGAAACAGTA TGCCATGTGT TCTGTACCAT CATGTGTCTC TTCTTGCTTC 2041 TGGGCCCTTC TACTGGTGAA CTTTCATCAA GATCTGCGCC ATGCCGTGTC ACTATCAAGC 2101 CATTAAGTTT TGTCTGGGTT GCTGTCAGCC CCAGTTGGCT TCCTGGTCAA CAAGGACCTC 2161 AAGAACTGCC TGTGGACCGA GGCCCCCTAC CAGTGCATGA GACACACACC TACCCTCCCC 2221 AGCTTTCCAG GAACCCTACT GGCTGCCAGA CTGATGGGCG GGCTGGTATG TGTGGACATG 2281 TGTTCACTGT CATTATGCTG TGGCTCCAGG TGAGGGTGAG GACTGGGCCT ATATAGAATC 2341 CAGATACCAT TGTCAACTTC CCTTATTCCC GTCTAAGATG TGAGCAGAGT GCCATAGTAG 2401 GGGTTCTGGG AAGAGGTATT TCTGATTTGT GGGCCTCTGC TTGCTTGACT TCAGGTCACT 2461 TATACTTCTT ATTTTGCTTG CCTGCCTTCA TCCCTCATTT CCTCCCTCTC ATTCTTCTTT 2521 CCTCCCTCCC TTTCCTGGTA GCCTCCTTTC CTCCCCTTCT GCCTTCCCCT TCCTTCTTTC 2581 CTTATTCTTT TTTATTTTGT TTAAATAGTA CCACAGAGAA AACAACTGAA AAACCACATT 2641 TTTCTACATA CAGCTGGGGA GGTAGCTGAG AACTTGGCAC TGCGCACACA TACTAGGTTG 2701 AAAGAGAGTT GAGGAAACCA GAAGGCCAAG TGGATCTGCT GGCAAACCCT GAACCTGTCT 2761 CCTGCGCTTG CTCTACAGTT CTGAAGTTGA AAATCCTTTT CATGCCTAGC ATCTGCTTGA 2821 GTTATAAACC CCAAGGCAGC CATGTCATAG ACTAGTGTTT ACTCTTGTTT TGACTTTGTT 2881 TTAATGCTTC CTAAGACCCA AGTGCCTCCT GCTGTTTCCT CCTTTGTGGT AGCCTCTGGC
2941 CATCTGGACC TCAATCCCCA GCTTTCCCAC TTTCAGCAGT CCTTTGCTCT CTTTGCTTCT
3001 ACCTCAAATA GCCCCAGGAG TGGGCTTTAG TCTCCAATAT GGAGCATCTC AAGCTTCTCC 3061 TGGGGGATGG GGATTGGGAT GGGCAGAATC TGTTTTGGAT CTCCGGGTTA TTTCCAGTGG 3121 GTGTAAAAGC AGAGCTGGGC CTTTCCCTCT CTTATCCCTG AGGGTGGGTA AGAAGGACTG 3181 TATCTACACC TGTTCTTCCC TACCTTCTCT TTTGTTAGGG AGGCCTCATT CTAAGTTCCT
3241 CAAGAGAGTC CTTGGCTTAA AGCTGTAGCA AGGGTGTGCT AGGTGGGGGA TTTGGAGCAA
3301 AACCGTCGAG TAGGCATGAT ACTGGTATGG AGTGGGCCTG CAAAATCAGA CAGAAATGGC 3361 TTGAGAAGCC GCAGGGGGAG CATGCCTGTC TCTCAGTGAT AGAGTATGGG AGGGACCTCC
3421 CTAGCTTGGA AAATGAGAAT TGAAGGGGTT ATGAACAAAT AGGATGCCTA GTTGAGGATG
3481 TTCCCAAAGT TTTGTCCAAT CTTATCATTA GTAGΆTTTTA TAAGCCACAG AGACAAACCA
3541 GAAACGGAAT AATGTTACTT TGGATGCTTT ATTTTTTTGT TCTAGGTGTG GCTTTGTACA
3601 TGCAGAAGAA TGCTATATGC TGCACATTTT GCCTTTAAAG TCTTACGACT TTCCCCATTT
3661 TAGTCTAATG GGAAGATACA GATGTGCAAG TCTGCTTTTT TGTTTTTTGT TATTATTTTT
3721 TTTTTTTTGC TCTGTGTTAT GGACATTTTC AGACATGCAC AGAAGTGGAG AGGATGGTCC
3781 TTGGACCCCA TGTGTCCATC ACCTAGCTGC ATCACTTATC AGCTATGGTC AACCTGGTTT
3841 CATCTGTATC TCTCTCTTTT CACCTGTATT GTTTATTGAA AATCCAAGAC ACTATGCCAA
3901 TGCAACCGTG ACTACTTTGG GAGATTGGTA GTCTCTTTTG ATGGTGATAG TGATGGGGTG
3961 CACTATCATA ATCACATCAG GTCTGCTTTT TGCTTTTAAT GTTAACTAAT GAAGTTCCAG
4021 AGATGGGCCT TAGAAATGTG TTTTAAGAAT TAACAAGGAG TCTCAAAAAG AAATGAGAGG 4081 GATGCTTCCT TTCCCTTGCA TCTACAAAAC AAGAGAGAGA CTGTTCTGTT GTAAAACTCT 4141 TTCAAAAATT CTGATATGGT AAGGTACTTG AGACCCTTCA CCAGAATGTC AATCTTTTTT 4201 TCTGTGTAAC ATGGAAACTT GTGTGACCAT TAGCATTGTT ATCAGCTTGT ACTGGTCTCA
4261 TAACTCTGGT TTTGGAAGAA TAATTTGGAA ATTGTTGCTG TGTTCTGTGA AAATAACCTC 4321 CCCAAAATAA TTAGTAACTG GTTGTTCTAC TTGGTAATTT GACACCCTGT TAATAACGCA
4381 ATTATTTCTG TGTTCTTAAA CAGTATAAAT AGTTGTAAGT TTGCATGCAT GATGGAAAAA 4441 TAAAAACCTG TATCTCTGTT ATAAAAAAAA AAAAAAAA
SEQ ID NO: 4 (ATP5E, presented here as sequence for ATP5E)
1 CGACGATCTT CCTGCGGCTG AACCGCCCGG CTGAGCCGAC ATTGCCGGCG TCTTGGCGAT
61 TCGGCCCGAC GAGCTCCGCT TTCGCTACAG CATGGTGGCC TACTGGAGAC AGGCTGGACT
121 CAGCTACATC CGATACTCCC AGATCTGTGC AAAAGCAGTG AGAGATGCAC TGAAGACAGA
181 ATTCAAAGCA AATGCTGAGA AGACTTCTGG CAGCAACGTA AAAATTGTGA AAGTAAAGAA
241 GGAATAATCT ACCCTGACTA AAGCTTGAAA TGCTACATTT CCAAGGTGAA GATGTGTGGG
301 CACATGTTAT GGCAGATTGA AAAGGATCTC ATTCCATGGG AAAAAAAAAA ATCCTGTCTT 36] GTTCATAAAT TGACAATGTC AATAAATTGA AATATGGTTC ACTGCCAAAA AAAAAAAAAA 421 AAAAAAAAAA AAAA
SEQ ID NO: 5 (FRAME, presented here as sequence for FRAME)
1 CGAGTTCCGG CGAGGCTTCA GGGTACAGCT CCCCCGCAGC CAGAAGCCGG GCCTGCAGCG
61 CCTCAGCACC GCTCCGGGAC ACCCCACCCG CTTCCCAGGC GTGACCTGTC AACAGCAACT
121 TCGCGGTGTG GTGAACTCTC TGAGGAAAAA CGTAAGTTCG AGCCCTGATT CCTCCGCTTC
181 CCCGCAGGGT GACCTTGGGC TTGTGCCCCC GGCACCACCC CTGTCCCGGG TCCCTGTTTT
241 CTCTCTGGAA ATGGGTTGAA GACCAAAGAA AATAATGTGC GCCACTTGGG TCACCCCGGG
301 CCGCCTGCCC CGGAAAATTC GCCCCAGTTG AGGAGTTGTG GCTGTAAGGA TGCCTTGAAC
361 CGAGGCGGCG GTGCTCGTGG TTGGAGCTCT CCAGGGTGGG TGCGCATTTG TAATGCGGTG
421 GATGCTCTGG GACTCGGCCC CTCTGAAGGT GCTGGGGGTT GGGGACGGCC CAGGCAGTGG
481 CGTAGGCGTC CTAGGAAGGC GGGAGCAGAG GCAGAAATGT CGCTGCAAGA CCGTAGTCAG
541 GGTCCTTGAC CACAGGGGTC ACTTGTGACC AACCACATGG TCTGTTGTTC CTCCTGCCCC
601 CTGGTTCAGC CCAGGAAACA CTGGTGCTCA GGTTTGGAGC CAGAGATTTG CACTGAAAGG
661 GCGGGATTGA GTCGCCAGTT GTCAGTTTCC TCAGCAGTAT TTGCGGAGGT TTTCACAGGA
721 GGCCGTTGCT TCGTAAATAT TATACATGTA TTCTTCTTTT TGGAGCATTT TGATTATTAC
781 TCTCAGACGT GCGTGGCAAC AAGTGACTGA GACCTAGAAA TCCAAGCGTT GGAGGTCCTG
841 AGGCCAGCCT AAGTCGCTTC AAAATGGAAC GAAGGCGTTT GTGGGGTTCC ATTCAGAGCC
901 GATACATCAG CATGAGTGTG TGGACAAGCC CACGGAGACT TGTGGAGCTG GCAGGGCAGA
961 GCCTGCTGAA GGATGAGGCC CTGGCCATTG CCGCCCTGGA GTTGCTGCCC AGGGAGCTCT
1021 TCCCGCCACT CTTCATGGCA GCCTTTGACG GGAGACACAG CCAGACCCTG AAGGCAATGG
1081 TGCAGGCCTG GCCCTTCACC TGCCTCCCTC TGGGAGTGCT GATGAAGGGA CAACATCTTC
1141 ACCTGGAGAC CTTCAAAGCT GTGCTTGATG GACTTGATGT GCTCCTTGCC CAGGAGGTTC
1201 GCCCCAGGAG GTGGAAACTT CAAGTGCTGG ATTTACGGAA GAACTCTCAT CAGGACTTCT
1261 GGACTGTATG GTCTGGAAAC AGGGCCAGTC TGTACTCATT TCCAGAGCCA GAAGCAGCTC
1321 AGCCCATGAC AAAGAAGCGA AAAGTAGATG GTTTGAGCAC AGAGGCAGAG CAGCCCTTCA 1381 TTCCAGTAGA GGTGCTCGTA GACCTGTTCC TCAAGGAAGG TGCCTGTGAT GAATTGTTCT 1441 CCTACCTCAT TGAGAAAGTG AAGCGAAAGA AAAATGTACT ACGCCTGTGC TGTAAGAAGC 1501 TGAAGATTTT TGCAATGCCC ATGCAGGATA TCAAGATGAT CCTGAAAATG GTGCAGCTGG 1561 ACTCTATTGA AGATTTGGAA GTGACTTGTA CCTGGAAGCT ACCCACCTTG GCGAAATTTT 1621 CTCCTTACCT GGGCCAGATG ATTAATCTGC GTAGACTCCT CCTCTCCCAC ATCCATGCAT 1681 CTTCCTACAT TTCCCCGGAG AAGGAAGAGC AGTATATCGC CCAGTTCACC TCTCAGTTCC 1741 TCAGTCTGCA GTGCCTGCAG GCTCTCTATG TGGACTCTTT ATTTTTCCTT AGAGGCCGCC 1801 TGGATCAGTT GCTCAGGCAC GTGATGAACC CCTTGGAAAC CCTCTCAATA ACTAACTGCC 1861 GGCTTTCGGA AGGGGATGTG ATGCATCTGT CCCAGAGTCC CAGCGTCAGT CAGCTAAGTG 1921 TCCTGAGTCT AAGTGGGGTC ATGCTGACCG ATGTAAGTCC CGAGCCCCTC CAAGCTCTGC 1981 TGGAGAGAGC CTCTGCCACC CTCCAGGACC TGGTCTTTGA TGAGTGTGGG ATCACGGATG 2041 ATCAGCTCCT TGCCCTCCTG CCTTCCCTGA GCCACTGCTC CCAGCTTACA ACCTTAAGCT 2101 TCTACGGGAA TTCCATCTCC ATATCTGCCT TGCAGAGTCT CCTGCAGCAC CTCATCGGGC 2161 TGAGCAATCT GACCCACGTG CTGTATCCTG TCCCCCTGGA GAGTTATGAG GACATCCATG 2221 GTACCCTCCA CCTGGAGAGG CTTGCCTATC TGCATGCCAG GCTCAGGGAG TTGCTGTGTG 2281 AGTTGGGGCG GCCCAGCATG GTCTGGCTTA GTGCCAACCC CTGTCCTCAC TGTGGGGACA 2341 GAACCTTCTA TGACCCGGAG CCCATCCTGT GCCCCTGTTT CATGCCTAAC TAGCTGGGTG 2401 CACATATCAA ATGCTTCATT CTGCATACTT GGACACTAAA GCCAGGATGT GCATGCATCT 2461 TGAAGCAACA AAGCAGCCAC AGTTTCAGAC AAATGTTCAG TGTGAGTGAG GAAAACATGT 2521 TCAGTGAGGA AAAAACATTC AGACAAATGT TCAGTGAGGA AAAAAAGGGG AAGTTGGGGA 2581 TAGGCAGATG TTGACTTGAG GAGTTAATGT GATCTTTGGG GAGATACATC TTATAGAGTT 2641 AGAAATAGAA TCTGAATTTC TAAAGGGAGA TTCTGGCTTG GGAAGTACAT GTAGGAGTTA 2701 ATCCCTGTGT AGACTGTTGT AAAGAAACTG TTGAAAATAA AGAGAAGCAA TGTGAAGCAA 2761 AAAAAAAAAA AAAAAA
SEQIDNO: 6 (PAK3)
1 GAGGTAGAGG AAACGTCTTG ACGGGGTGGC TGGATCCGTG GCAGAATCCA GTTCCAGATT
61 CTAGACTTGA GGGTTCTGGG CTGTTGGTCT GTAGAAGCGA AGGAGAGAAG GACTCAAATC
121 CAGGCCAAGT GTATGGCTGT CTGAGGTATT GGAACAGAAG GAGGTCCATT CCTGTTGGTG 181 ACAACACCGT GGCCCTGTTC TGGGATGAGC AAGGTGTAAA GGTTTCCCCC AAGAAAGAGC
241 AGCTGAGTCC TTGCATCTTG TGGCAGCTGG TGTGCCCAGC ACTGAGTCTG TAGGAGCTGA
301 AGCCAGCCCG GACCCTTCTC ATGGGCAGTG CCCACCTGTG CTGAAGTCCT GCAGCGGTGG
361 CGGTGTGAGG AGCTGTGAAA TTAGTTGTAA CTGAAAATGT CTGACGGTCT GGATAATGAA 421 GAGAAACCCC CGGCTCCTCC ACTGAGGATG AATAGTAACA ACCGGGATTC TTCAGCACTC
481 AACCACAGCT CCAAACCACT TCCCATGGCC CCTGAAGAGA AGAATAAGAA AGCCAGGCTT 541 CGCTCTATCT TCCCAGGAGG AGGGGATAAA ACCAATAAGA AGAAGGAGAA AGAGCGCCCA 601 GAGATCTCTC TTCCTTCAGA CTTTGAGCAT ACGATTCATG TGGGGTTTGA TGCAGTCACC 661 GGGGAATTCA CTGGAATTCC AGAGCAATGG GCACGATTAC TCCAAACTTC CAACATAACA 721 AAATTGGAAC AGAAGAAGAA CCCACAAGCT GTTCTAGATG TTCTCAAATT CTATGATTCC 781 AAAGAAACAG TCAACAACCA GAAATACATG AGCTTTACAT CAGGAGATAA AAGTGCACAT 841 GGATACATAG CAGCCCATCC TTCGAGTACA AAAACAGCAT CTGAGCCTCC ATTGGCCCCT 901 CCTGTGTCTG AAGAAGAAGA TGAAGAGGAA GAAGAAGAAG AAGATGAAAA TGAGCCACCA 961 CCAGTTATCG CACCAAGACC AGAGCATACA AAATCAATCT ATACTCGTTC TGTGGTTGAA
1021 TCCATTGCTT CACCAGCAGT ACCAAATAAA GAGGTCACAC CACCCTCTGC TGAAAATGCC 1081 AATTCCAGTA CTTTGTACAG GAACACAGAT CGGCAAAGAA AAAAATCCAA GATGACAGAT 1141 GAGGAGATCT TAGAGAAGCT AAGAAGCATT GTGAGTGTTG GGGACCCAAA GAAAAAATAC 1201 ACAAGATTTG AAAAAATTGG TCAAGGGGCA TCAGGTACTG TTTATACAGC ACTAGACATT
1261 GCAACAGGAC AAGAGGTGGC CATAAAGCAG ATGAACCTTC AACAGCAACC CAAGAAGGAA
1321 TTAATTATTA ATGAAATTCT GGTCATGAGG GAAAATAAGA ACCCTAATAT TGTTAATTAT 1381 TTAGATAGCT ACTTGGTGGG TGATGAACTA TGGGTAGTCA TGGAATACTT GGCTGGTGGC 1441 TCTCTGACTG ATGTGGTCAC AGAGACCTGT ATGGATGAAG GACAGATAGC AGCTGTCTGC 1501 AGAGAGTGCC TGCAAGCTTT GGATTTCCTG CACTCAAACC AGGTGATCCA TAGAGATATA 1561 AAGAGTGACA ATATTCTTCT CGGGATGGAT GGCTCTGTTA AATTGACTGA CTTTGGGTTC 1621 TGTGCCCAGA TCACTCCTGA GCAAAGTAAA CGAAGCACTA TGGTGGGAAC CCCATATTGG 1681 ATGGCACCTG AGGTGGTGAC TCGAAAAGCT TATGGTCCGA AAGTTGATAT CTGGTCTCTT
1741 GGAATTATGG CAATTGAAAT GGTGGAAGGT GAACCCCCTT ACCTTAATGA AAATCCACTC
1801 AGGGCATTGT ATCTGATAGC CACTAATGGA ACTCCAGAGC TCCAGAATCC TGAGAGACTG 1861 TCAGCTGTAT TCCGTGACTT TTTAAATCGC TGTCTTGAGA TGGATGTGGA TAGGCGAGGA 1921 TCTGCCAAGG AGCTTTTGCA GCATCCATTT TTAAAATTAG CCAAGCCTCT CTCCAGCCTG 1981 ACTCCTCTGA TTATCGCTGC AAAGGAAGCA ATTAAGAACA GCAGCCGCTA AGACTGCAAG 2041 CCTTACACCT CACCATCTCC CTCATGAGTA AGACTGAAAT AAAACTCTGC TGCAGGAAAG 2101 ATGGAAGAAA AGACAGTCAA ATGGGGTGGG GGTTCTTTAC CTTTCAAATG AATAGAAACT 2161 TCTTATAAGC CTTTTTCCTA CTCCCTCAGA TTATGTAATT TATTTGTAAG CCTGAATCGC 2221 AGCCCAAACA GGGCAGCAAT GTTGAAGTGA CCATAAAGTG GTCACTTCCA CCGTGAAGCG 2281 AAAGAGCCAG TAGTGAATCC CCTCATTTTG TGCATTCACT TTGAAGAAAA AGGTTTCTCA 2341 AAGATGCACA CTCCCTCTTC ATAGTGTTGT GTTTGTTTTT AAGTTAGAGA GTAGTCCCTC 2401 TTGCATTCAA ACCTCCTTCA AAACTCCTTA CCCAATGTGA TGTTTTTCAC TTGCATTGTC 2461 ATTAGATGTC CAGAAAAAAA AAAGATGTCA AAATGTTTTT CTAAAAAAAG AAAGCA
SEQ ID NO: 7 (RPL23)
1 GGCCACGTGA GGAGGGTGGG CGGGGCGTTA AAGTTCATAT CCCAGTGTCC TTTGAATCGA
61 CTTCCTTTTT TCTTTTTTCC GGCGTTCAAG ATGTCGAAGC GAGGACGTGG TGGGTCCTCT
121 GGTGCGAAAT TCCGGATTTC CTTGGGTCTT CCGGTAGGAG CTGTAATCAA TTGTGCTGAC
181 AACACAGGAG CCAAAAACCT GTATATCATC TCCGTGAAGG GGATCAAGGG ACGGCTGAAC
241 AGACTTCCCG CTGCTGGTGT GGGTGACATG GTGΆTGGCCA CAGTCAAGAA AGGCAAACCA
301 GAGCTCAGAA AAAAGGTACA TCCAGCAGTG GTCATTCGAC AACGAAAGTC ATACCGTAGA
361 AAAGATGGCG TGTTTCTTTA TTTTGAAGAT AATGCAGGAG TCATAGTGAA CAATAAAGGC
421 GAGATGAAAG GTTCTGCCAT TACAGGACCA GTAGCAAAGG AGTGTGCAGA CTTGTGGCCC
481 CGGATTGCAT CCAATGCTGG CAGCATTGCA TGATTCTCCA GTATATTTGT AAAAAATAAA
541 AAAAAAAACT AAACCCATTA AAAAGTATTT GTTTGCAAAA AAAAAAAAAA AAAA
SEQ ID NO: 8 (EI24)
1 CCCCGCCTCG TGGTGCCGGC TGGTTCTTCG CGCTCGCCCG ACTTCCCAGC GGCCCCGTGC 61 GGCCCGGGCA TGCCCAGTGC GGGCGCAGCG GCCCCGGCCC TGGAAGCGCC CCGGCGGAGC 121 TGGCCTGCGG TGGGCTAGGG GCAGGGCCGG AGCCGCGGCG GCGGAGCTGT GGATCCTTCA 181 TGATGAGAGA TTTGGGGACA CTTCTCTCTC CTGTGTGTAG TTGATAGTTT GGTGGTGAAG 241 AGATGGCTGA CAGTGTCAAA ACCTTTCTCC AGGACCTTGC CAGAGGAATC AAAGACTCCA
301 TCTGGGGTAT TTGTACCATC TCAAAGCTΆG ATGCTCGAAT CCAGCAAAAG AGAGAGGAGC
361 AGCGTCGAAG AAGGGCAAGT AGTGTCTTGG CACAGAGAAG AGCCCAGAGT ATAGAGCGGA
421 AGCAAGAGAG TGAGCCACGT ATTGTTAGTA GAATTTTCCA GTGTTGTGCT TGGAATGGTG
481 GAGTGTTCTG GTTCAGTCTC CTCTTGTTTT ATCGAGTATT TATTCCTGTG CTTCAGTCGG
541 TAACAGCCCG AATTATCGGT GACCCATCAC TACATGGAGA TGTTTGGTCG TGGCTGGAAT
601 TCTTCCTCAC GTCAATTTTC AGTGCTCTTT GGGTGCTCCC CTTGTTTGTG CTTAGCAAAG
661 TGGTGAATGC CATTTGGTTT CAGGATATAG CTGACCTGGC ATTTGAGGTA TCAGGGAGGA
72] AGCCTCACCC ATTCCCTAGT GTCAGCAAAA TAATTGCTGA CATGCTCTTC AACCTTTTGC
781 TGCAGGCTCT TTTCCTCATT CAGGGAATGT TTGTGAGTCT CTTTCCCATC CATCTTGTCG
841 GTCAGCTGGT TAGTCTCCTG CATATGTCCC TTCTCTACTC ACTGTACTGC TTTGAATATC
901 GTTGGTTCAA TAAAGGAATT GAAATGCACC AGCGGTTGTC TAACATAGAA AGGAATTGGC
961 CTTACTACTT TGGGTTTGGT TTGCCCTTGG CTTTTCTCAC AGCAATGCAG TCCTCATATA
1021 TTATCAGTGG CTGCCTTTTC TCTATCCTCT TTCCTTTATT CATTATCAGC GCCAATGAAG
1081 CAAAGACCCC TGGCAAAGCA TATCTCTTCC AGTTGCGCCT CTTCTCCTTG GTGGTCTTCT
1141 TAAGCAACAG ACTCTTCCAC AAGACAGTCT ACCTGCAGTC GGCCCTGAGC AGCTCTACTT
1201 CTGCAGAGAA GTTCCCTTCA CCGCATCCGT CGCCTGCCAA ACTGAAGGCT ACTGCAGGTC
1261 ACTGAGTTGC CTGCCATCCA AAGGGGATGG GCGGGATTGG AAGAAGCTGT GGCAGCTCTT
1321 TTCCCTGTTC ACCTCCCGCC TGCCAGGGAA GGCAGGACCC GCTCTGCCAA GGGCCCTCTG 1381 CGTATTCCCT TCTCTCTGAG GAATTGAAAT TTTTGTCTCT GGTGCACGTA AGGCAGAATG 1441 TTCCCTGACA CCAGTGTGTG GATTTTTAAC ATCACCGTGA GTCTGAAAGG ACCACAGGTT 1501 TTTCTGCAGC TATTTTCTAG CATTTGCCAG TCCCTGTGCC TGGACTGATT GGAACACTTT
1561 GTTTTTCTCC CTGTGCCATT TACCCTTCCA CCTTTCCATC CTGCCTTCTA CCACCCTTGG 1621 ATGAATGGAT TTTGTAATTC TAGCTGTTGT ATTTTGTGAA TTTGTTAATT TTGTTGTTTT
1681 TCTGTGAAAC ACATACATTG GATΆTGGGAG GTAAAGGAGT GTCCCAGTTG CTCCTGGTCA
1741 CTCCCTTTAT AGCCATTACT GTCTTGTTTC TTGTAACTCA GGTTAGGTTT TGGTCTCTCT
1801 TGCTCCACTG CAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAGCCTGA AGAGATGAGA
1861 TAGGAGGAAA GACCTCACAG CCAGATCTGC TGGGTTTTGA GGAGTGATTT TCTTTCTTCC
1921 CCTTGAAGGG GAAAAAGCTA TTTTCATTGG TACATTTAAA GTCCCCCAAC TATGGGGAGG
1981 TACCAATTCT GGACAAGTGC CACTACAACA ACACTAAACC TGAACTTTTC AACTCCGTTG
2041 GTGGTGGGAG GCAGCGGGCA GAAATTTACT GTTGGCCACT GCCAGGTCTA TTTCATATTT
2101 CAAAGGAATA TTGGGTGCTG CATATAGGAA CTGAAGGGGT CAATGTATTA AACCTGTGAT
2161 TGGTTGTTTT CCTGTCATTT TGAGAGACTA AATGTGGGGG GCAGATGTCA AAATACCTGT
2221 ACAATTTTAA AATGTCACAA TTAAACATGA GCTGGTTTCC CAAAAAAAAA AAAAAAAAA
SEQ ID NO: 9 (TGFB3)
1 CCTGTTTAGA CACATGGACA ACAATCCCAG CGCTACAAGG CACACAGTCC GCTTCTTCGT
61 CCTCAGGGTT GCCAGCGCTT CCTGGAAGTC CTGAAGCTCT CGCAGTGCAG TGAGTTCATG
121 CACCTTCTTG CCAAGCCTCA GTCTTTGGGA TCTGGGGAGG CCGCCTGGTT TTCCTCCCTC
181 CTTCTGCACG TCTGCTGGGG TCTCTTCCTC TCCAGGCCTT GCCGTCCCCC TGGCCTCTCT
241 TCCCAGCTCA CACATGAAGA TGCACTTGCA AAGGGCTCTG GTGGTCCTGG CCCTGCTGAA
301 CTTTGCCACG GTCAGCCTCT CTCTGTCCAC TTGCACCACC TTGGACTTCG GCCACATCAA
361 GAAGAAGAGG GTGGAAGCCA TTAGGGGACA GATCTTGAGC AAGCTCAGGC TCACCAGCCC
421 CCCTGAGCCA ACGGTGATGA CCCACGTCCC CTATCAGGTC CTGGCCCTTT ACAACAGCAC
481 CCGGGAGCTG CTGGAGGAGA TGCATGGGGA GAGGGAGGAA GGCTGCACCC AGGAAAACAC
541 CGAGTCGGAA TACTATGCCA AAGAAATCCA TAAATTCGAC ATGATCCAGG GGCTGGCGGA
601 GCACAACGAA CTGGCTGTCT GCCCTAAAGG AATTACCTCC AAGGTTTTCC GCTTCAATGT
661 GTCCTCAGTG GAGAAAAATA GAACCAACCT ATTCCGAGCA GAATTCCGGG TCTTGCGGGT
721 GCCCAACCCC AGCTCTAAGC GGAATGAGCA GAGGATCGAG CTCTTCCAGA TCCTTCGGCC
781 AGATGAGCAC ATTGCCAAAC AGCGCTATAT CGGTGGCAAG AATCTGCCCA CACGGGGCAC
841 TGCCGAGTGG CTGTCCTTTG ATGTCACTGA CACTGTGCGT GAGTGGCTGT TGAGAAGAGA
901 GTCCAACTTA GGTCTAGAAA TCAGCATTCA CTGTCCATGT CACACCTTTC AGCCCAATGG
961 AGATATCCTG GAAAACATTC ACGAGGTGAT GGAAATCAAA TTCAAAGGCG TGGACAATGA
1021 GGATGACCAT GGCCGTGGAG ATCTGGGGCG CCTCAAGAAG CAGAAGGATC ACCACAACCC
1081 TCATCTAATC CTCATGATGA TTCCCCCACA CCGGCTCGAC AACCCGGGCC AGGGGGGTCA 1141 GAGGAAGAAG CGGGCTTTGG ACACCAATTA CTGCTTCCGC AACTTGGAGG AGAACTGCTG
1201 TGTGCGCCCC CTCTACATTG ACTTCCGACA GGATCTGGGC TGGAAGTGGG TCCATGAACC
1261 TAAGGGCTAC TATGCCAACT TCTGCTCAGG CCCTTGCCCA TACCTCCGCA GTGCAGACAC 1321 AACCCACAGC ACGGTGCTGG GACTGTACAA CACTCTGAAC CCTGAAGCAT CTGCCTCGCC
1381 TTGCTGCGTG CCCCAGGACC TGGAGCCCCT GACCATCCTG TACTATGTTG GGAGGACCCC 1441 CAAAGTGGAG CAGCTCTCCA ACATGGTGGT GAAGTCTTGT AAATGTAGCT GAGACCCCAC
1501 GTGCGACAGA GAGAGGGGAG AGAGAACCAC CACTGCCTGA CTGCCCGCTC CTCGGGAAAC 1561 ACACAAGCAA CAAACCTCAC TGAGAGGCCT GGAGCCCACA ACCTTCGGCT CCGGGCAAAT 1621 GGCTGAGATG GAGGTTTCCT TTTGGAACAT TTCTTTCTTG CTGGCTCTGA GAATCACGGT
1681 GGTAAAGAAA GTGTGGGTTT GGTTAGAGGA AGGCTGAACT CTTCAGAACA CACAGACTTT 1741 CTGTGACGCA GACAGAGGGG ATGGGGATAG AGGAAAGGGA TGGTAAGTTG AGATGTTGTG 1801 TGGCAATGGG ATTTGGGCTA CCCTAAAGGG AGAAGGAAGG GCAGAGAATG GCTGGGTCAG 1861 GGCCAGACTG GAAGACACTT CAGATCTGAG GTTGGATTTG CTCATTGCTG TACCACATCT 1921 GCTCTAGGGA ATCTGGATTA TGTTATACAA GGCAAGCATT TTTTTTTTTA AAGACAGGTT 1981 ACGAAGACAA AGTCCCAGAA TTGTATCTCA TACTGTCTGG GATTAAGGGC AAATCTATTA 2041 CTTTTGCAAA CTGTCCTCTA CATCAATTAA CATCGTGGGT CACTACAGGG AGAAAATCCA 2101 GGTCATGCAG TTCCTGGCCC ATCAACTGTA TTGGGCCTTT TGGATATGCT GAACGCAGAA 2161 GAAAGGGTGG AAATCAACCC TCTCCTGTCT GCCCTCTGGG TCCCTCCTCT CACCTCTCCC 2221 TCGATCATAT TTCCCCTTGG ACACTTGGTT AGACGCCTTC CAGGTCAGGA TGCACATTTC 2281 TGGATTGTGG TTCCATGCAG CCTTGGGGCA TTATGGGTCT TCCCCCACTT CCCCTCCAAG 2341 ACCCTGTGTT CATTTGGTGT TCCTGGAAGC AGGTGCTACA ACATGTGAGG CATTCGGGGA 2401 AGCTGCACAT GTGCCACACA GTGACTTGGC CCCAGACGCA TAGACTGAGG TATAAAGACA 2461 AGTATGAATA TTACTCTCAA AATCTTTGTA TAAATAAATA TTTTTGGGGC ATCCTGGATG 2521 ATTTCATCTT CTGGAATATT GTTTCTAGAA CAGTAAAAGC CTTATTCTAA GGTG
SEQIDNO: 10 (RBM34)
1 AGCTGCAGTC TGGGAGTCTT TGGAGTAAGA ATGGCCTTGG AAGGGATGAG CAAACGGAAG
61 AGAAAGAGAA GTGTCCAGGA GGGAGAGAAT CCTGACGACG GCGTTCGCGG GAGTCCGCCG
121 GAAGACTACA GGCTTGGACA GGTCGCCAGT AGCTTATTTC GCGGCGAACA CCATTCCAGA
181 GGTGGCACCG GTCGGCTGGC GTCCCTCTTC AGTTCTCTGG AGCCCCAGAT TCAACCCGTG
241 TACGTGCCTG TGCCTAAACA AACCATCAAA AAAACGAAAC GGAATGAGGA GGAAGAAAGT
301 ACATCCCAGA TTGAAAGACC ACTTTCGCAA GAACCTGCCA AAAAAGTGAA AGCGAAGAAG
361 AAACACACTA ACGCAGAAAA AAAGTTGGCA GACAGGGAAA GCGCTCTAGC GAGTGCTGAT
421 TTAGAAGAAG AAATTCACCA GAAACAAGGG CAGAAAAGGA AAAATTCTCA ACCTGGTGTT
481 AAAGTAGCAG ATAGAAAAAT ACTTGATGAC ACAGAAGACA CAGTTGTCAG TCAAAGAAAG
541 AAAATTCAAA TCAACCAAGA AGAAGAGAGA TTAAAGAATG AGAGAACTGT GTTTGTTGGG
60] AATTTGCCTG TTACATGTAA TAAGAAGAAG CTGAAGTCGT TTTTTAAAGA GTATGGACAA
661 ATAGAATCTG TACGATTTCG TTCTCTGATT CCAGCAGAGG GAACGCTATC CAAAAAGTTG
721 GCAGCAATAA AACGTAAAAT TCATCCTGAT CAGAAAAATA TTAATGCCTA TGTTGTGTTT
781 AAGGAGGAGA GTGCTGCCAC GCAAGCATTG AAAAGAAATG GGGCCCAGAT TGCAGATGGA
841 TTTCGTATTA GAGTTGATCT CGCATCTGAG ACCTCATCTA GAGACAAGAG ATCGGTTTTT
901 GTGGGGAATC TCCCTTATAA AGTTGAAGAA TCTGCCATTG AGAAGCACTT TCTGGACTGT
961 GGAAGTATCA TGGCCGTGAG GATTGTGAGA GACAAAATGA CAGGCATCGG CAAAGGGTTT
1021 GGCTATGTGC TCTTTGAGAA TACAGATTCT GTTCATCTTG CTCTGAAATT AAATAATTCT
1081 GAACTCATGG GGAGAAAACT CAGAGTCATG CGTTCTGTTA ATAAAGAAAA ATTTAAACAA
1141 CAAAATTCAA ATCCACGATT GAAGAATGTC AGTAAACCTA AGCAGGGACT TAATTTTACT
1201 TCCAAAACTG CAGAAGGACA TCCTAAAAGC TTATTTATTG GAGAAAAAGC TGTTCTCCTT
1261 AAAACGAAGA AGAAAGGACA GAAGAAAAGT GGACGCCCTA AGAAACAGAG AAAACAGAAA
1321 TAACAACCAG GAACTGCTTT TTCTTTTCCT GCTGAGTACT GCTAATAAAA GTGCTATTAT 1381 CTGCTGATAG CATCGTCTGC TAAAAAAAAA AAAAAAAAAA AAAAAAAAAA A
SEQ ID NO: 11 (PCOLN3)
1 CCGGAAGTGG GGCGGCGACC CCGGAAGTCC CCGCCGGGTG CAGCTTGGTC GGTTCGATCG
61 CCGCCGGGAC CTGACACCGC CCGGAGTTGG CGTCCCTTCT CCCTCTCCGA GTGCTGCTCC
121 TGTCATTGTG GCCATGGACG ATACCCTGTT CCAGTTGAAG TTCACGGCGA AGCAGCTGGA 181 GAAGCTGGCC AAGAAGGCGG AGAAGGACTC CAAGGCGGAG CAGGCCAAAG TGAAGAAGGC
241 CCTTCTGCAG AAAAATGTAG AGTGTGCCCG TGTGTATGCC GAGAACGCCA TCCGCAAGAA
301 GAACGAAGGT GTGAACTGGC TTCGGATGGC GTCCCGCGTA GACGCAGTGG CCTCCAAGGT
361 GCAGACAGCT GTGACTATGA AGGGGGTGAC CAAGAATATG GCCCAGGTGA CCAAAGCCCT
421 GGACAAGGCC CTGAGCACCA TGGACCTGCA GAAGGTCTCC TCAGTGATGG ACAGGTTCGA
481 GCAGCAGGTG CAGAACCTGG ACGTCCATAC ATCGGTGATG GAGGACTCCA TGAGCTCGGC
541 CACCACCCTG ACCACGCCGC AGGAGCAGGT GGACAGCCTC ATCATGCAGA TCGCCGAGGA
601 GAATGGCCTG GAGGTGCTGG ACCAGCTCAG CCAGCTGCCC GAGGGCGCCT CTGCCGTGGG
661 CGAGAGCTCT GTGCGCAGCC AGGAGGACCA GCTGTCACGG AGGTTGGCCG CCTTGAGGAA
721 CTAGCCGTGC CCCGCCGGTG TGCACCGCCT CTGCCCCGTG ATGTGCTGGA AGGCTCCTGT
781 CCTCTCCCCA CCGCGTCTTG CCTTTGTGCT GACCCCGCGG GGCTGCGGCC GGCAGCCACT
841 CTGCGTCTCT CACCTGCCAG GCCTGCGTGG CCTTAGGGTT GTTCCTGTTC TTTTAGGTTG
901 GGCGGTGGGT CTGTGTCCTG GTGTTGAGTT TCTGCAAATT TCTGGGGGTG ATTTCTGTGA
961 CTCTGGGCCC ACAGCGGGGA GGCCAAGAGG GGCCCTGTGG ACTTTCACCC AGCACTGTGG
1021 GGGCCTTCAG ACTCTGGGGC AGCAGACATG CTGCTTCCCA TCAGCCAGAG GGGGTCAGGG
1081 CTGCCCTGTT GCCAAACAAC TCCCTGAGGC CTCTCCGCAC CACCTCAGCG GGCAGGAGGT
1141 CCCACCATGT GGACAGACAT AGCCCAAGGA GGCACCACAG GTCTATGTGT GCTGGGGGAT
1201 GTCAGGTGCC ACCCAACGCT GTCCTGGTGG TATTTACAAT GACATCCTCC TCCTCCATCA
1261 CTCCAGGGGT GGTGTCTCGG CCGCCCCTAC CAGCTGGCTG AGCCCCCTGG CCTCCTGCGC
1321 TCCCTCACTT CCCTCAGTTC CCAAAGCTGC CCAGTCCATG GGGACAGAAC CGTCACTCAG 1381 ATCCACATTC AAGTGTGCCC ACCCTGCAGT CTTCATCCTC ACTCAGCTGC TGCCTCTGGA 1441 GGTGCCTTTG GCCACATGTG CTGTGCTGTT TGTCTCCTCG ACAGGGAGCC TGTCCACCAG 1501 CAGGCTGCGG TCCCAGCGGG TGCGTCTGCA GCTCCTCCCC TTGGGCAGCC TGGTTCTCCC 1561 GGAGGACCTT TCCTTGGGGC CCTGCTTCAT GACGATGCTG CCTGTGTCAC CCTCTACCAT 1621 CTGTAAACAA CTGGGTGCCT TCCCCGACCA CACCCCAATG CCTTCCCAGC TTGGAAGCCA 1681 AGGCAGCTGA TGAAGGGAGC TCAGGAGAGC CGTCTTCAGC TGGGCTGGGG TTGGGGCTGC 1741 TGTGAGGAAA ACCTGCCATT GTGGCCCTGG AGAGTCACCA GCAGCTCTTG GGAAGGACTT 1801 GCTGGGAGGC TGAGAGAGGC TTTGGGCACA GCCTGCTGTC TTTTCCATTT CCTAAAGTTT 1861 ACTTCATTGT CTTGAGGCTT CCAGGTTTTG TTTTTGTTTT TGCCAAAGTA GAAAAGGCAG 1921 GTGGTGGGCG GCTGGCAGGG AGTGCGGGTC CCCGCCCCTC TTCAGTCCTG CCCTCCCCTC 1981 CTCAGTCCTG CCCACCCCGT GCAGCCCATG CTGAGGCTGC AGTGGTGTCG TGGGTGTTAC 2041 GTGCAGGAAC GTGGAGACCC TGACGTGGGC TCACTGCGTT TGGTTTTCTT TTCAGAACTT 2101 GGGAGCCCCC AGGGAGGGGC TAGTGTTGGT AGGTCCTAGA CGTGGTTCCC TCCAGCCTCC 2161 CCAAAATCAA CCCTGGTGTT GAGAGAACGT CCTTCTGTCC ATCGTGGGTA ACAGCCTTGG 2221 GGAGGGTGCA GAGCTCTGCA GAGCCATGGG CCAGGTGGGG CTGCCTCAGT CCTGTCCCCT 2281 TGGGCACTGA GGAGAGGGGC CCATTCACCT TTCTCCTAGA ATGCTGTTGT AAATAAACAA 2341 ATGGATCCCT GGAAAAAAAA AAAAAAAAAA AAA
SEQlDNO: 12 (FUT7)
1 CCAGGCTCCA GGCCCTGGCT TAGAGGGAGC GGGGAGTCTG GACTTCAGGC TGGATCCCTT
61 CCTCTTCCTG GAGCGGGTGC TGGCCCCCAA CCCGCTTGCG TCAGGGACAA AAGGACTCCT
121 TCCCTTTCCA GCCTGGAAAG CCCCTCTGCT GCAGGCTGGA GGAAGGGACC CTGGGCCCAG
181 CCTATAGTCA GCGGTGTCTA TGGGCATGGA TCTGGACGGG GAAAAGGACA AAGCAGCCTC
241 CATCCACAGT TCATTCCGGG ACCAGGCCCT TGCAGGCACG CGCTGGGCTC CTGTGGGAAG
301 ACACTAAGGG CCCCAGGACA GACCTCCTCT CCGGGCATCT GGGTTCCTAG ATGGCAGAGG
361 TGGCAGAGTG GGGTGGGATG GCCCAATTGG GAGCTTTAGC TTCCGGCAAA GAGCTGAGCA 421 CAGTACATCT TCATTGTGTA AGATTCTCCT GGGAGACCAG GGCCCAGCTG GTGGTGAGCT
481 GGGGGAAGTG GGTGATACTG CCGTGGGAGG AGCCACCTGG CCCTCTGGGG AAGTGCΆCTC
541 GCTGTCTGCA GCGCCCAGGC CTGGGTAGCT GGGTGGGGGC TGGGGGGCCA TCTGTGCTCA 601 GGGTGCCTGC ACCTGGGCCT TCTCTGCCCT GGGCCAAGCC TGCCCGAGCC TCTCTGTCCT 661 CTGCCTGCCC AGCTGGACAT CTCTGGGCCT CTCTGGAGAC CAGTGGGGTG GGCTGTGGGG 721 GCGTCATATT GCCCTGGCTT GGCATCCCTC TTGTGGCTGT ACCCCTCCCA GCAGCCCCAG 781 GACTAGCAAG TCCCCGAGAT GGGGGTGGGG ACAGTGGTTG ATGCCAAAGG TTGTGGGGGC 841 AGGGGCGGGG CAGGAGCAGG AAGGTCCCCT GAGTTCCCTC ACCTTGGGCA GAGATAAAAG 901 GAGCACAGTT CCAGGCGGGG CTGAGCTAGG GCGTAGCTGT GATTTCAGGG GCACCTCTGG 961 CGGCTGCCGT GATTTGAGAA TCTCGGGTCT CTTGGCTGAC TGATCCTGGG AGACTGTGGA 1021 TGAATAATGC TGGGCACGGC CCCACCCGGA GGCTGCGAGG CTTGGGGGTC CTGGCCGGGG 1081 TGGCTCTGCT CGCTGCCCTC TGGCTCCTGT GGCTGCTGGG GTCAGCCCCT CGGGGTACCC 1141 CGGCACCCCA GCCCACGATC ACCATCCTTG TCTGGCACTG GCCCTTCACT GACCAGCCCC 1201 CAGAGCTGCC CAGCGACACC TGCACCCGCT ACGGCATCGC CCGCTGCCAC CTGAGTGCCA
1261 ACCGAAGCCT GCTGGCCAGC GCCGACGCCG TGGTCTTCCA CCACCGCGAG CTGCAGACCC
1321 GGCGGTCCCA CCTGCCCCTG GCCCAGCGGC CGCGAGGGCA GCCCTGGGTG TGGGCCTCCA
1381 TGGAGTCTCC TAGCCACACC CACGGCCTCA GCCACCTCCG AGGCATCTTC AACTGGGTGC
1441 TGAGCTACCG GCGCGACTCG GACATCTTTG TGCCCTATGG CCGCCTGGAG CCCCACTGGG 1501 GGCCCTCGCC ACCGCTGCCA GCCAAGAGCA GGGTGGCCGC CTGGGTGGTC AGCAACTTCC 1561 AGGAGCGGCA GCTGCGTGCC AGGCTGTACC GGCAGCTGGC GCCTCATCTG CGGGTGGATG 1621 TCTTTGGCCG TGCCAATGGA CGGCCACTGT GCGCCAGCTG CCTGGTGCCC ACCGTGGCCC 1681 AGTACCGCTT CTACCTGTCC TTTGAGAACT CTCAGCACCG CGACTACATT ACGGAGAAAT
1741 TCTGGCGCAA CGCACTGGTG GCTGGCACTG TGCCAGTGGT GCTGGGGCCC CCACGGGCCA
1801 CCTATGAGGC CTTCGTGCCG GCTGACGCCT TCGTGCATGT GGATGACTTT GGCTCAGCCC 1861 GAGAGCTGGC GGCTTTCCTC ACTGGCATGA ATGAGAGCCG ATACCAACGC TTCTTTGCCT 1921 GGCGTGACAG GCTCCGCGTG CGACTGTTCA CCGACTGGCG GGAACGTTTC TGTGCCATCT 1981 GTGACCGCTA CCCACACCTA CCCCGCAGCC AAGTCTATGA GGACCTTGAG GGTTGGTTTC 2041 AGGCCTGAGA TCCGCTGGCC GGGGGAGGTG GGTGTGGGTG GAAGGGCTGG GTGTCGAAAT 2101 CAAACCACCA GGCATCCGGC CCTTACCGGC AAGCAGCGGG CTAACGGGAG GCTGGGCACA 2161 GAGGTCAGGA AGCAGGGGTG GGGGGTGCAG GTGGGCACTG GAGCATGCAG AGGAGGTGAG 2221 AGTGGGAGGG AGGTAACGGG TGCCTGCTGC GGCAGACGGG AGGGGAAAGG CTGCCGAGGA
2281 cccTCCCCAC CCTGAACAAA TCTTGGGTGG GTGAAGGCCT GGCTGGAAGA GGGTGAAAGG
2341 CAGGGCCCTT GGGGCTGGGG GGCACCCCAG CCTGAAGTTT GTGGGGGCCA AACCTGGGAC
2401 CCCGAGCTTC CTCGGTAGCA GAGGCCCTGT GGTCCCCGAG ACACAGGCAC GGGTCCCTGC
2461 CACGTCCATA GTTCTGAGGT CCCTGTGTGT AGGCTGGGGC GGGGCCCAGG AGACCACGGG
2521 GAGCAAACCA GCTTGTTCTG GGCTCAGGGA GGGAGGGCGG TGGACAATAA ACGTCTGAGC 2581 AGTGAAAAAA AAAAAAAA
SEQ ID NO: 13 (RICS Rho)
1 AGGAGTCATC CATCAACACT CCTGCTGTCG GTGCTGCCCA TGTTATCAAG AGGTACACTG
61 CTCGGGCCCC TGACGAACTG ACCTTAGAGG TGGGAGACAT TGTTTCTGTT ATTGACATGC
121 CCCCGAAAGT GTTAAGCACA TGGTGGAGAG GCAAGCACGG ATTCCAGGTG GGACTCTTCC
181 CTGGACACTG TGTTGAGTTA ATTAACCAAA AAGTTCCCCA GTCAGTGACC AACTCΆGTGC
241 CAAAACCAGT GTCTAAAAAG CACGGCAAGC TCATTACGTT CTTACGAACA TTCATGAAGT
301 CTCGTCCAAC AAAACAGAAG CTGAAGCAGC GGGGAATCTT GAAAGAGAGG GTGTTTGGTT
361 GTGACCTGGG GGAGCACCTT CTAAATTCTG GTTTTGAAGT GCCGCAGGTT CTTCAAAGCT
421 GCACAGCATT CATTGAGAGA TATGGCATCG TGGATGGAAT CTATCGCCTT TCTGGTGTTG
481 CCTCCAATAT CCAGAGACTA CGCCATGAAT TTGACTCTGA GCACGTCCCC GACCTGACGA
541 AAGAACCGTA TGTTCAGGAC ATCCATTCTG TGGGTTCCCT ATGTAAGCTG TACTTCCGGG
601 AACTCCCAAA CCCTCTGCTT ACCTACCAGC TGTATGAGAA ATTTTCTGAT GCAGTTTCAG
661 CAGCAACAGA TGAAGAAAGG CTGATAAAAA TCCACGATGT CATCCAGCAG CTCCCCCCAC
721 CACACTACAG AACACTGGAG TTCCTGATGA GACACTTGTC TCTTCTAGCT GACTATTGTT
781 CCATCACAAA TATGCATGCA AAAAATCTAG CAATTGTTTG GGCTCCAAAC CTGTTAAGAT
841 CAAAACAGAT AGAATCTGCC TGCTTCAGTG GAACAGCAGC TTTCATGGAA GTGAGGATTC
901 AGTCTGTGGT TGTTGAGTTC ATCCTGAATC ACGTTGATGT GCTGCTGCCA CACTTCAGCG
961 CACGCACAGA ACTAATCGTC CCCTTCCCCC TCCGCCTTCT CAGAAAACAG TTTACTCCTC
1021 CTTTGCTAGG CCCGATGTCA CCACTGAACC CTTTGGTCCA GATAACTGTT TGCATTTCAA
1081 TATGACTCCA AACTGCCAGT ACCGTCCCCA GAGTGTACCT CCCCATCACA ATAAATTGGA
1141 GCAGCACCAA GTGTATGGTG CCAGGTCAGA GCCACCAGCC TCCATGGGTC TTCGTTATAA
1201 CACATATGTG GCCCCAGGAA GAAACGCATC TGGACACCAC TCCAAGCCAT GCAGCCGGGT
1261 CGAGTATGTG TCTTCTTTGA GCTCCTCTGT CAGGAATACC TGTTACCCCG AAGACATTCC 1321 ACCGTACCCT ACCATCCGGA GAGTGCAGTC TCTCCATGCT CCGCCGTCTT CCATGATTCG
1381 CTCTGTTCCC ATTTCACGGA CAGAAGTTCC CCCAGATGAT GAGCCAGCCT ACTGCCCAAG 1441 ACCTCTGTAC CAATATAAGC CATATCAGTC CTCCCAGGCC CGCTCAGATT ATCATGTCAC 1501 TCAGCTTCAG CCTTACTTTG AGAATGGCCG GGTCCACTAC AGGTATAGCC CATATTCCAG 1561 TTCTTCTAGT TCCTATTACA GTCCAGATGG GGCCCTGTGT GATGTGGATG CCTATGGCAC 1621 AGTCCAGTTG AGACCCCTTC ACCGCCTTCC CAATCGAGAC TTTGCTTTCT ACAATCCTAG 1681 GCTGCAAGGA AAGAGCTTGT ACAGTTATGC TGGTTTGGCT CCACGTCCCC GGGCCAACGT 1741 GACTGGCTAT TTCTCTCCCA ACGACCATAA TGTAGTCAGC ATGCCTCCGG CTGCTGATGT 1801 GAAGCACACC TACACCTCAT GGGATCTTGA GGACATGGAA AAATACCGCA TGCAGTCCAT 1861 CCGGAGAGAG AGCCGTGCTC GGCAGAAGGT GAAAGGGCCT GTCATGTCCC AATATGATAA 1921 CATGACCCCG GCGGTGCAGG ACGACTTGGG TGGGATCTAT GTCATCCATC TGCGTAGTAA 1981 ATCAGATCCT GGGAAAACTG GACTTCTCTC AGTGGCAGAA GGAAAGGAGA GCCGCCATGC 2041 AGCCAAGGCC ATCAGTCCCG AGGGAGAGGA CCGCTTCTAT AGGAGGCATC CCGAGGCAGA 2101 GATGGACAGA GCCCACCATC ACGGAGGCCA TGGTAGCACG CAGCCGGAGA AGCCATCCCT 2161 GCCTCAGAAG CAGAGCAGCC TGAGGAGCAG GAAGCTTCCT GACATGGGCT GCAGTCTTCC 2221 TGAGCACAGG GCACACCAAG AAGCAAGCCA TAGGCAGTTC TGTGAGTCAA AGAATGGGCC 2281 CCCTTATCCC CAGGGAGCTG GCCAGTTAGA TTATGGGTCC AAAGGGATTC CAGACACTTC 2341 TGAGCCAGTC AGCTACCACA ACTCTGGAGT AAAATATGCT GCATCCGGGC AAGAATCTTT 2401 AAGACTGAAC CACAAAGAGG TAAGGCTCTC CAAAGAGATG GAGCGACCCT GGGTTAGGCA 2461 GCCTTCTGCC CCAGAGAAAC ACTCCAGAGA CTGCTACAAG GAGGAAGAAC ACCTCACTCA 2521 GTCAATCGTC CCACCCCCTA AACCAGAGAG GAGTCATAGC CTCAAACTCC ATCATACCCA 2581 GAACGTGGAG AGGGACCCCA GTGTGCTGTA CCAGTACCAA CCACACGGCA AGCGCCAGAG 2641 CAGTGTGACT GTTGTGTCCC AGTATGATAA CCTGGAAGAT TACCACTCCC TGCCTCAGCA 2701 CCAGCGAGGA GTCTTTGGAG GGGGCGGCAT GGGGACGTAT GTGCCCCCTG GCTTTCCCCA 2761 TCCACAGAGC AGGACCTATG CTACAGCGTT GGGTCAAGGG GCCTTCCTGC CCGCAGAGTT 2821 GTCCTTGCAG CATCCTGAAA CACAGATCCA TGCAGAATGA GCCCTGCGAG CAATAGAGTT 2881 GAAGCAGCCT CTGCTGGACA GTGGACTGTT CTATTTTTTT CAATAACCAA AAAAAAAAAA 2941 AAAAAAA
SEQ ID NO: 14 (MAP4K4)
1 AATTCGAGGA TCCGGGTACC ATGGCACAGA GCGACAGAGA CATTTATTGT TATTTGTTTT
61 TTGGTGGCAA AAAGGGAAAA TGGCGAACGA CTCCCCTGCA AAAAGTCTGG TGGACATCGA
121 CCTCTCCTCC CTGCGGGATC CTGCTGGGAT TTTTGAGCTG GTGGAAGTGG TTGGAAATGG
181 CACCTATGGA CAAGTCTATA AGGGTCGACA TGTTAAAACG GGTCAGTTGG CAGCCATCAA o Λ 1 ΔfiTTδT^rϊAT ^ ^ CAC T GAC C ATGAACAGGA AGAAATCAAA ^^^^^ ^^ ^^ ^ ^ rr-A r-por-tr-p-n τ\ τ\
301 GAAATACTCT CATCACAGAA ACATTGCAAC ATATTATGGT GCTTTCATCA AAAAGAGCCC
361 TCCAGGACAT GATGACCAAC TCTGGCTTGT TATGGAGTTC TGTGGGGCTG GGTCCATTAC
42 1 AGACCTTGTG AAGAACACCA AAGGGAACAC ACTCAAAGAA GACTGGATCG CTTACATCTC
481 CAGAGAAATC CTGAGGGGAC TGGCACATCT TCACATTCAT CATGTGATTC ACCGGGATAT
541 CAAGGGCCAG AATGTGTTGC TGACTGAGAA TGCAGAGGTG AAACTTGTTG ACTTTGGTGT
601 GAGTGCTCAG CTGGACAGGA CTGTGGGGCG GAGAAATACG TTCATAGGCA CTCCCTACTG
661 GATGGCTCCT GAGGTCATCG CCTGTGATGA GAACCCAGAT GCCACCTATG ATTACAGAAG
72 1 TGATCTTTGG TCTTGTGGCA TTACAGCCAT TGAGATGGCA GAAGGTGCTC CCCCTCTCTG
781 TGACATGCAT CCAATGAGAG CACTGTTTCT CATTCCCAGA AACCCTCCTC CCCGGCTGAA
841 GTCAAAAAAA TGGTCGAAGA AGTTTTTTAG TTTTATAGAA GGGTGCCTGG TGAAGAATTA
901 CATGCAGCGG CCCTCTACAG AGCAGCTTTT GAAACATCCT TTTATAAGGG ATCAGCCAAA
961 TGAAAGGCAA GTTAGAATCC AGCTTAAGGA TCATATAGAT CGTACCAGGA AGAAGAGAGG
1021 CGAGAAAGAT GAAACTGAGT ATGAGTACAG TGGGAGTGAG GAAGAAGAGG AGGAAGTGCC
1081 TGAACAGGAA GGAGAGCCAA GTTCCATTGT GAACGTGCCT GGTGAGTCTA CTCTTCGCCG
1141 AGATTTCCTG AGACTGCAGC AGGAGAACAA GGAACGTTCC GAGGCTCTTC GGAGACAACA
1201 GTTACTACAG GAGCAACAGC TCCGGGAGCA GGAAGAATAT AAAAGGCAAC TGCTGGCAGA
1261 GAGACAGAAG CGGATTGAGC AGCAGAAAGA ACAGAGGCGA CGGCTAGAAG AGCAACAAAG
1321 GAGAGAGCGG GAGGCTAGAA GGCAGCAGGA ACGTGAACAG CGAAGGAGAG AACAAGAAGA 1381 AAAGAGGCGT CTAGAGGAGT TGGAGAGAAG GCGCAAAGAA GAAGAGGAGA GGAGACGGGC
1441 AGAAGAAGAA AAGAGGAGAG TTGAAAGAGA ACAGGAGTAT ATCAGGCGAC AGCTAGAΆGA
1501 GGAGCAGCGG CACTTGGAAG TCCTTCAGCA GCAGCTGCTC CAGGAGCAGG CCATGTTACT
1561 GCATGACCAT AGGAGGCCGC ACCCGCAGCA CTCGCAGCAG CCGCCACCAC CGCAGCAGGA 162 1 AAGGAGCAAG CCAAGCTTCC ATGCTCCCGA GCCCAAAGCC CACTACGAGC CTGCTGACCG
1681 AGCGCGAGAG GTTCCTGTGA GAACAACATC TCGCTCCCCT GTTCTGTCCC GTCGAGATTC
1741 CCCACTGCAG GGCAGTGGGC AGCAGAATAG CCAGGCAGGA CAGAGAAACT CCACCAGTAT 1801 TGAGCCCAGG CTTCTGTGGG AGAGAGTGGA GAAGCTGGTG CCCAGACCTG GCAGTGGCAG
1861 CTCCTCAGGG TCCAGCAACT CAGGATCCCA GCCCGGGTCT CACCCTGGGT CTCAGAGTGG 1921 CTCCGGGGAA CGCTTCAGAG TGAGATCATC ATCCAAGTCT GAAGGCTCTC CATCTCAGCG 1981 CCTGGAAAAT GCAGTGAAAA AACCTGAAGA TAAAAAGGAA GTTTTCAGAC CCCTCAAGCC 2041 TGCTGGCGAA GTGGATCTGA CCGCACTGGC CAAAGAGCTT CGAGCAGTGG AAGATGTACG 2101 GCCACCTCAC AAAGTAACGG ACTACTCCTC ATCCAGTGAG GAGTCGGGGA CGACGGATGA 2161 GGAGGACGAC GATGTGGAGC AGGAAGGGGC TGACGAGTCC ACCTCAGGAC CAGAGGACAC 2221 CAGAGCAGCG TCATCTCTGA ATTTGAGCAA TGGTGAAACG GAATCTGTGA AAACCATGAT
2281 TGTCCATGAT GATGTAGAAA GTGAGCCGGC CATGACCCCA TCCAAGGAGG GCACTCTAAT
2341 CGTCCGCCAG ACTCAGTCCG CTAGTAGCAC ACTCCAGAAA CACAAATCTT CCTCCTCCTT 2401 TACACCTTTT ATAGACCCCA GATTACTACA GATTTCTCCA TCTAGCGGAA CAACAGTGAC 2461 ATCTGTGGTG GGATTTTCCT GTGATGGGAT GAGACCAGAA GCCATAAGGC AAGATCCTAC 2521 CCGGAAAGGC TCAGTGGTCA ATGTGAATCC TACCAACACT AGGCCACAGA GTGACACCCC 2581 GGAGATTCGT AAATACAAGA AGAGGTTTAA CTCTGAGATT CTGTGTGCTG CCTTATGGGG 2641 AGTGAATTTG CTAGTGGGTA CAGAGAGTGG CCTGATGCTG CTGGACAGAA GTGGCCAAGG 2701 GAAGGTCTAT CCTCTTATCA ACCGAAGACG ATTTCAACAA ATGGACGTAC TTGAGGGCTT 2761 GAATGTCTTG GTGACAATAT CTGGCAAAAA GGATAAGTTA CGTGTCTACT ATTTGTCCTG 2821 GTTAAGAAAT AAAATACTTC ACAATGATCC AGAAGTTGAG AAGAAGCAGG GATGGACAAC 2881 CGTAGGGGAT TTGGAAGGAT GTGTACATTA TAAAGTTGTA AAATATGAAA GAATCAAATT
2941 TCTGGTGATT GCTTTGAAGA GTTCTGTGGA AGTCTATGCG TGGGCACCAA AGCCATATCA
3001 CAAATTTATG GCCTTTAAGT CATTTGGAGA ATTGGTACAT AAGCCATTAC TGGTGGATCT 3061 CACTGTTGAG GAAGGCCAGA GGTTGAAAGT GATCTATGGA TCCTGTGCTG GATTCCATGC 3121 TGTTGATGTG GATTCAGGAT CAGTCTATGA CATTTATCTA CCAACACATG TAAGAAAGAA 3181 CCCACACTCT ATGATCCAGT GTAGCATCAA ACCCCATGCA ATCATCATCC TCCCCAATAC 3241 AGATGGAATG GAGCTTCTGG TGTGCTATGA AGATGAGGGG GTTTATGTAA ACACATATGG 3301 AAGGATCACC AAGGATGTAG TTCTACAGTG GGGAGAGATG CCTACATCAG TAGCATATAT 3361 TCGATCCAAT CAGACAATGG GCTGGGGAGA GAAGGCCATA GAGATCCGAT CTGTGGAAAC 3421 TGGTCACTTG GATGGTGTGT TCATGCACAA AAGGGCTCAA AGACTAAAAT TCTTGTGTGA 3481 ACGCAATGAC AAGGTGTTCT TTGCCTCTGT TCGGTCTGGT GGCAGCAGTC AGGTTTATTT 3541 CATGACCTTA GGCAGGACTT CTCTTCTGAG CTGGTAGAAG CAGTGTGATC CAGGGATTAC 3601 TGGCCTCCAG AGTCTTCAAG ATCCTGAGAA CTTGGAATTC CTTGTAACTG GAGCTCGGAG 3661 CTGCACCGAG GGCAACCAGG ACAGCTGTGT GTGCAGACCT CATGTGTTGG GTTCTCTCCC 3721 CTCCTTCCTG TTCCTCTTAT ATACCAGTTT ATCCCCATTC TTTTTTTTTT TCTTACTCCA 3781 AAATAAATCA AGGCTGCAAT GCAGCTGGTG CTGTTCAGAT TCCAAAAAAA AAAAAAAACC 3841 ATGGTACCCG GATCCTCGAA TTCC
SEQIDNO: 15 (CUTLl)
1 CGTCTCAATA TGTCTCAAGA TGGCGGCCAA TGTGGGATCG ATGTTTCAAT ATTGGAAGCG
61 CTTTGATTTA CAGCAGCTGC AGAGAGAACT CGATGCCACC GCAACGGTAT TGGCGAACCG
121 GCAGGATGAA AGTGAGCAGT CCAGAAAGCG GCTTATCGAA CAGAGCCGGG AGTTCAAGAA
181 GAACACTCCA GAGGATTTGC GCAAGCAGGT AGCGCCGCTG CTGAAGAGTT TCCAAGGAGA
241 GATTGATGCA CTGAGTAAAA GAAGCAAGGA AGCTGAAGCA GCTTTCTTGA ATGTCTACAA
301 AAGATTGATT GACGTCCCAG ATCCCGTACC AGCTTTGGAT CTCGGACAGC AACTCCAGCT
361 CAAAGTGCAG CGCCTGCACG ATATTGAAAC AGAGAACCAG AAACTTAGGG AAACTCTGGA
421 AGAATACAAC AAGGAATTTG CTGAAGTGAA AAATCAAGAG GTTACGATAA AAGCACTTAA
481 AGAGAAAATC CGAGAATATG AACAGACACT GAAGAACCAA GCCGAAACCA TAGCTCTTGA
541 GAAGGAACAG AAGTTACAGA ATGACTTTGC AGAAAAGGAG AGAAAGCTGC AGGAGACACA
601 GATGTCCACC ACCTCAAAGC TGGAGGAAGC TGAGCATAAG GTTCAGAGCC TACAAACAGC
661 CCTGGAAAAA ACTCGAACAG AATTATTTGA CCTGAAAACC AAATACGATG AAGAAACTAC
721 TGCAAAGGCC GACGAGATTG AAATGATCAT GACGGACCTT GAAAGGGCAA ACCAGAGGGC
781 AGAGGTGGCT CAGAGAGAGG CGGAGACCTT AAGGGAACAG CTCTCATCGG CCAATCACTC
841 CCTCCAGCTG GCCTCACAGA TCCAGAAGGC ACCAGACGTG GAGCAGGCCA TAGAGGTGCT
901 GACCCGCTCC AGCCTAGAAG TTGAGTTGGC CGCCAAGGAG CGGGAGATCG CACAGCTGGT
961 GGAGGACGTG CAGAGACTCC AGGCCAGCCT CACCAAGCTG CGGGAGAATT CGGCCAGCCA
1021 GATCTCACAG CTTGAGCAGC AGCTGAGCGC CAAAAACAGC ACACTCAAAC AACTGGAAGA
1081 AAAACTCAAA GGCCAGGCTG ACTATGAAGA GGTGAAGAAA GAGCTGAACA TTCTGAAGTC
1141 CATGGAGTTT GCACCGTCCG AGGGCGCTGG GACACAGGAT GCGGCCAAGC CCCTGGAGGT
1201 GCTGTTGCTG GAGAAGAACC GCTCGCTGCA GTCCGAGAAC GCCGCGCTGC GCATCTCCAA
1261 CAGCGACCTG AGCGGACGCT GTGCAGAGCT GCAAGTCCGT ATCACTGAGG CTGTGGCCAC 1321 AGCCACTGAG CAGAGAGAGC TGATCGCCCG CCTGGAGCAG GACCTGAGCA TCATTCAGTC
1381 CATCCAGCGG CCCGATGCCG AGGGTGCCGC TGACCACCGC CTGGAGAAGA TCCCAGAGCC 1441 CATCAAAGAG GCCACTGCCC TATTCTACGG ACCTGCAGCA CCAGCCAGCG GTGCCCTCCC
1501 AGAGGGCCAG GTGGATTCAC TGCTTTCCAT CATCTCCAGC CAGAGGGAGC GCTTCCGTGC 1561 CCGGAACCAG GAGCTTGAGG CCGAGAACCG CCTGGCCCAG CACACCCTCC AGGCCCTGCA 1621 GAGTGAGCTG GACAGCCTGC GCGCCGACAA CATCAAGCTC TTTGAGAAGA TCAAGTTCCT
1681 GCAGAGCTAC CCTGGCCGGG GCAGCGGCAG TGATGACACG GAGCTGCGGT ACTCGTCCCA 1741 GTACGAGGAG CGCCTGGACC CCTTCTCCTC CTTCAGCAAG CGGGAGCGGC AGAGGAAGTA 1801 CCTGAGCTTG AGTCCCTGGG ACAAGGCCAC CCTCAGCATG GGGCGTCTGG TTCTCTCCAA 1861 CAAGATGGCG CGCACCATCG GCTTCTTCTA CACACTGTTC CTGCACTGCC TGGTCTTCCT 1921 GGTGCTCTAC AAGCTGGCAT GGAGCGAGAG CATGGAGAGG GACTGTGCCA CCTTCTGCGC 1981 CAAGAAGTTC GCTGACCACC TGCACAAGTT CCACGAGAAT GACAACGGGG CTGCGGCTGG 2041 TGACTTGTGG CAGTGATACC CCGGGGCCTC CCCCGTGACA GTGACGGCTG CGCCTCCACC 2101 CCGACTGCTC AGTGCATCTA ATCACTTAGA CTCCCCTGAA GAATCCCCCA TGGAAACTGC 2161 CCTTATCCGC TGTCCAGCAG CTGCCAGAGG CCCCAGGTCA CCTCGGGTCC CCTTGAAAGA 2221 ATGTCTCGGT CACATCAGGC CCGCTAGGTC CAGAGAGCGA GCCCCCAATG CCCGGCCAGG 2281 CTAAGCCGCA GAGACCCTCT CAGCCCCCAC CTCAGGTTAG GGCTCTGCCC GCAGCCTGAC 2341 CTCTAGCCCT GGTGGCAGAG GTCCCTCAGC TGCGAGGCTA ATTGGGTGAC CACCGATTCC 2401 AGCTGCGGTT AATCCAGCTT GGGCCTGTCT GCACTGCGAT CCTCTTGGGC TCTCCTAGGA
2461 TCCCCCCATG CCCCGTAAGA GGTGGAAGAC GCTTCCTTCC AGGACAGCAG GCTTTGAGTC 2521 CAGCACCCCC AGCCTGCCTT TGCCACCAGC CCCACCCTGC AGAGTATATG AGGCTTGACA
2581 GAGTCTGCCC CCTCCCCCAC TGCACCCCAA GAGAGAGAGC CCCAGCCAGC GGAACAGTTT 2641 CTATTACCCC CTCCCTGCCC CCAGACCCAT GTGATTTCTG CTTTCTTCTT TAGCAAGATA 2701 TTCTGGTTTC TAGATAAGGA AGAGTCTCTA ATGAGCCCCC GAGCCCCAGT CTCTTCAGAC 2761 TCATGGATTG GTCTGAGGGG TCTGAACGTC TCCTAGCCAA TCAGAACTGG CTGTGGACCA 2821 CCCTAGCACG GCCACCTCTC AGGGCCACTG GCAGGCCTTC CTGAGTTAGA TTTGTAGTTG 2881 CATATTTAGC TTTGCACATT TGAAATAAAC CACGGTTGCA GCCAAAAAAA AAAAAAAAAA 2941 AA
SEQ IDNO: 16 (ZNF324B)
1 ACTTCCGGCG TTGGGACTGT CACTTGGCTG CTCGCGTCAG GCCACACGGG TGGTCTGGGC
61 TGTGGCGCGC GGGTCGGGGC CCGAGGCGGG CGGCCAGGAA GGACCTGATG ACCTTCGAGG
121 ATGTGGCCGT GTACTTCTCC CAGGAGGAGT GGGGGCTCCT GGACACAGCG CAGAGGGCCC
181 TGTACCGCCA CGTGATGCTG GAAAACTTCA CACTTGTGAC CTCACTTGGA CTCTCTACCT
241 CCCGACCTCG TGTGGTCATT CAACTTGAGC GTGGCGAGGA GCCCTGGGTT CCCAGCGGAA
301 AGGACATGAC CCTGGCCAGG AACACCTACG GGAGGCTCAA CTCTGGTTCC TGGAGTTTGA
361 CAGAGGATAG AGATGTTTCT GGAGAATGGC CACGAGCTTT CCCAGATACC CCACCTGGGA
421 TGACTACTAG CGTCTTCCCT GTTGCCGATG CCTGCCACAG TGTAAAAAGC CTGCAGCGAC
481 AACCGGGTGC CTCCCCATCT CAGGAGAGAA AACCCACGGG GGTGTCGGTG ATCTACTGGG
541 AGAGGCTCCT GCTAGGCTCG CGCAGTGACC AGGCCAGCAT CAGCCTGCGA CTGACCTCCC
601 CACTCAGGCC CCCCAAGAGC AGCCGGCCCA GGGAAAAGAC CTTCACAGAG TACCGGGTGC
661 CTGGGAGGCA GCCCAGGACG CCTGAGCGGC AGAAGCCATG TGCACAGGAG GTCCCTGGGA
721 GAGCCTTCGG GAATGCCTCG GACCTGAAGG CCGCCAGTGG TGGCAGGGAT CGCAGAATGG
781 GCGCAGCTTG GCAGGAGCCT CATAGACTCC TCGGTGGCCA GGAGCCCTCG ACCTGGGACG
841 AGCTGGGCGA GGCTCTTCAC GCTGGGGAGA AGTCCTTCGA ATGCAGGGCG TGCAGCAAAG
901 TGTTCGTGAA GAGCTCCGAC CTCCTCAAGC ACCTACGCAC CCACACCGGG GAGCGGCCCT
961 ACGAGTGCAC CCAGTGCGGC AAGGCCTTCA GCCAGACGTC GCACTTGACG CAGCACCAGC
1021 GCATCCACAG CGGCGAGACG CCCTACGCGT GCCCCGTGTG CGGCAAGGCC TTCCGGCATA
1081 GCTCCTCGCT GGTGCGGCAC CAGCGCATCC ACACGGCCGA GAAGTCCTTC CGCTGCTCCG
1141 AGTGCGGCAA GGCCTTCAGC CACGGCTCCA ACCTCAGCCA GCACCGCAAG ATCCACGCGG
1201 GTGGGCGTCC TTATGCTTGC GCACAGTGTG GCCGCCGCTT CTGCCGCAAC TCGCACCTGA
1261 TCCAGCACGA GCGTACGCAC ACAGGCGAGA AGCCCTTCGT ATGCGCGCTC TGCGGTGCTG
1321 CCTTCAGCCA GGGCTCCTCG CTCTTTTTGC ACCAGCGCGT GCACACAGGC GAGAAGCCCT 1381 TCGCCTGCGC CCAGTGCGGC CGCTCCTTTA GCCGCAGCTC CAACCTCACC CAGCACCAGC 1441 TCCTGCACAC GGGCGAGCGG CCCTTCCGCT GCGTGGGCTG TGGCAAGGGT TTCGCCAAGG 1501 GCGCCGTGCT GCTCAGCCAC CGGCGCATTC ACACGGGCGA GAAGCCCTTC GTGTGCACGC 1561 AGTGTGGCCG CGCCTTCCGT GAGCGCCCTG CCCTCTTGCA CCACCAGAGG ATCCACACCA 1621 CAGAGAAGAC CAATGCCGCA GCACCAGACT GCACCCCGGG GCCAGGTTTC CTTCAGGGAC 1681 ATCATCGGAA GGTGCGCCGG GGAGGGAAGC CAAGCCCAGT CCTGAAGCCA GCGAAGGTCT 1741 GAGGTCACAG GTCGCAGCCC AACCCTTTCT TGGCCTTCTG TGAATCCCTT CCACAGCTAA
1801 AGGGTCCGAG TGCTCTTCAG ATCCACGATG GGGAAAAGCT CTGTGCCTGA GAGTCAGGGA
1861 CGAGGGAGAC CCTTTGGCTG TGGTTCCATT TGCAGGTGGG GACAGGATTT GCCAGTTTAG
1921 TCATAGCTCA CACCTCCATC CTCAAAGAGG TAACACTGCA GAAACATCAG AGGGAGGACA
1981 TGTCAGCTGG AACTCTGGTG GGGCTGAGGC TGTAGTTGGG GCCATAGGAC GCCGACAAAG
2041 GCAGCGCTGC ATGGTGGTGC TACTTCATGT GTTATGAGAG TGGATGCTGA GGTGAGGGGG
2101 ATGCGGACAT GGGGTAGGAT GACCTAGAGA AACTTATGAT GTCTGCACAC AAACTGGCCG
2161 CTAGACGGAC GCTGAGGACA TTTTCCCCCT GAGGCCTCTA TTCAAGGCTT CCTGGGGGCC
222] ATCTCAGCAA ACAGGAGACT ACAGGGGACT GGGGATCAGG GTGTGGCCTG TGAGTGTCAG
2281 CCTCCTCCTC GGAAAAAGAA AAGCTTTGGG TCAACTCAGC ATCATGTTTG CAGATGCTGA
2341 CAGACGGGAT CCTAATGAGA GTCAATGTGT GCTCACTGCC AGCTCCTGGG CTGTGCTCTG
2401 GTCAGCCAGG TGTGAGGGCC TGGCCTGGGG TCACACAGCT GACTCAGGAG AGGAATGCCC
2461 ATGGTTCTCA GCATTGGAAG GACAAACCTA GGATGATGGC TTTCCAGTGG CACTCGTTCA
2521 GGTTTTCGTC CAAGTCTCAG CTTGGCCAAG GCCTGTCGCT CACTCATTTA CAAAAGTCGA 2581 TGTGAGGAGG AGCCTTTACA CCTGTGGAGA CAGTGATAGC TTTGGAGCAG ATAAGGTGGA
2641 GCTGCTCATT TTTGCTGGAT TTGGTGGCCG CTCCCCGCCC CCACCCCCAC CCCCTCCATC
2701 TCACCTTTCC CTTGTTATGC CTCCTCAATT GGAGGCTGGA CAGAGAGCTG AATAGGAAGG
2761 ACTTGCCATT ACCTAAGGCC ATGTGTGACA GCCTCCTGAG GACCTCCCCA ACCCAGTGTG
2821 ATGGGCCTGC ATGGCAGAGA CAAAAGGGTA GACTGGGGGT CATTTGCTTC CTGTGGCCTT
2881 AAGCCTACTA GGCCCCATCC TTACCTGAGA CCTCACCTCC AAGAAATTAA TGGTCTTTTC
2941 AATGGAGAAA AAAAAAGACT AGTATTTGCA ACTTAAAATA GATGTAGTTT CCTTT
Nucleotide Combinations
1,2 10, 11 2,4,5 1,3 1,2,3 2,4,6 1,4 1,2,4 2,4,7 1,5 1, 2, 5 2, 4, 8 1,6 1,2,6 2, 4, 9 1,7 1,2,7 2, 4, 10 1,8 1,2,8 2, 4, 11 1,9 1,2,9 2,5,6 1, 10 1,2, 10 2,5,7
1, 11 1.2, 11 2,5,8 2,3 1,3,4 2,5,9 2.4 1,3,5 2, 5, 10 2.5 1,3,6 2, 5, 11 2,6 1,3,7 2,6,7 2,7 1,3,8 2,6,8 2,8 1,3,9 2,6,9 2,9 1.3, 10 2.6, 10
2, 10 1,3, 11 2, 6, 11 2,11 1,4,5 2, 7, 8 3,4 1,4,6 2,7,9 3,5 1,4,7 2.7, 10 3,6 1,4,8 2, 7, 11 3,7 1,4,9 2,8,9 3,8 1,4, 10 2.8, 10 3,9 1,4, 11 2, 8, 11
3, 10 1,5,6 2.9, 10
3, 11 1,5,7 2, 9, 11 4,5 1,5,8 2, 10, 11 4,6 1,5,9 3,4,5 4,7 1,5, 10 3,4,6 4.8 1.5, 11 3,4,7 4,9 1,6,7 3,4,8 4,! 10 1,6,8 3,4,9
4, 11 1,6,9 3,4, 10 5,6 1.6, 10 3,4, 11 5,7 1.6, 11 3,5,6 5,8 1,7,8 3, 5, 7 5,9 1,7,9 3,5,8
5, 10 1.7, 10 3, 5, 9
5, 11 1,7, 11 3, 5, 10 6,7 1, 9 3, 5, 11 6,8 1, 10 3, 6, 7 6,9 1. 11 3, 6, 8
6, 10 1,9, 10 3,6,9
6, 11 1,9. Il 3,6, 10 7,8 1, 10, 11 3,6,11 7,9 2,3.4 7,8
7, 10 2,3,5 7,9
7, 11 2,3,6 7, 10 8,9 2,3,7 7, 11
8, 10 2, 3, 8 8,9
8, 11 2,3,9 8, 10
9, 10 2,3, 10 8, 11 9.11 2, 3, 11 ,9, 10 3, 9, 11 8, 10, 11 1,3,7,8
3, 10, 11 9, 10, 11 1,3,7,9
4, 5. 6 1, 2,3. ,4 1,3,7, 10
4, 5; 7 1, 2,3. , 5 1,3,7, 11
4, 5, 8 1, 2,3. ,6 1,3,8.9
4, 5, 9 1, 2,3, ,7 1,3,8, 10 0 1 11
Figure imgf000065_0001
4, 6, 11 1, 2,4, ,7 1,4,5,8
4, 7, 8 1, 2.4, 8 1,4,5,9
4, 9 1, 2,4, 9 1,4,5, 10
4, 7, 10 1, 2,4, 10 1.4.5, 11
4, 7, 11 1, 2,4, 11 1,4,6,7
4, 8, 9 1, 2,5, 6 1,4,6,8
4, 8, 10 1, 2,5, 7 1,4,6,9
4, 8. 11 1, 2,5, 8 1.4.6, 10
4, 9, 10 1, 2,5, 9 1.4, 6, 11
4, 9, 11 1, 2,5, 10 1,4,7,8
4, 1C I, 11 1, 2,5, 11 1,4,7,9
6, 7 1, 2,6, 7 1.4.7, 10
5, 6, 8 1, 2,6, 8 1.4.7, 11
5, 6, 9 1, 2,6, 9 1,4,8,9
5, 6, 10 1, 2,6, 10 1.4.8, 10
5, 6, 11 1, 2,6, 11 1.4.8, 11
5. 8 1. 2.7. 8 1 , 4, 9, 10
5, 7, 9 1, 2,7, 9 1.4.9, 11
5, 7, 10 1, 2.7, 10 1.4, 10, 11
5, 7, 11 I, 2,7, 11 1,5,6,7
5, 8, 9 1, 2,8, 9 1,5,6,8
5, 8, 10 1, 2,8, 10 1,5,6,9
5, 8, 11 1, 2,8. 11 1,5,6, 10
5, 9, 10 1, 2,9, 10 1.5.6, 11
5, 9, 11 1, 2,9, 11 1, 5, 7, 8
5 10, 11 1, 2, K ), 11 1,5,7,9
6, 7, 8 1, 3,4, 5 1.5.7, 10
6, 7. 9 1, 3,4, 6 1.5, 7, 11
6, 7, 10 1, 3,4, 7 1,5,8,9
6, 7, 11 1, 3,4, 8 1.5.8, 10
6, 8, 9 1, 3,4, 9 1.5.8, 11
6, 8, 10 1, 3,4, 10 1.5.9, 10
6, 8, 11 1. 3,4, 11 1,5, 9, 11
6, 9, 10 1, 3,5, 6 1.5, 10, 11
6, 9, 11 L 3,5, 7 1,6,7,8
6, 10 , 11 L 3,5, 8 1,6,7,9
7, 8, 9 L 3,5, 9 1,6,7, 10
7. 8, 10 1, 3,5, 10 1.6.7, 11
7, 8, 11 1, 3, 5, 11 1,6,8,9
7, 9, 10 1, 3,6, 7 1.6.8, 10
1, 9, 11 1, 3,6. 8 1.6.8, 11
7, 10 , 11 1, 3,6, 9 1.6.9, 10
8, 9, 10 1, 3,6, 10 1.6, 9, 11
% 9 1, 3,6, 11 1,6.10.11 1.7,8, 9 2,4,9, 10 3,4, 9.10
7,8, 10 2,4,9, 11 3,4, 9. 11 7,8, 11 2.4, 10, 11 3,4. K ), 11 7,9, 10 2.5, 6, 7 3,5. 6. 7
1,7,9, 11 2.5,6,8 3,5. 6, 8
1,7, 10, 11 2.5,6,9 3,5, 6, 9
9, 10 2.5.6, 10 3.5. 6, 10 9,11 2.5.6, 11 3.5. 6, 11
10, 11 2, 5, 7.8 3,5, 7, 8 10, 11 2,5.7.9 3.5. 7, 9
.4,5 2.5.7.10 3.5. 7, 10
2,3,4,6 2.5.7, 11 3,5, 7, 11
2,3,4.7 2, 5.8.9 3,5, 8, 9
2,3,4.8 2.5.8.10 3,5, 8, 10
2,3,4.9 2.5.8, 11 3,5, 8, U
2,3,4.10 2, 5.9.10 3,5, 9, 10
2,3,4.11 2.5.9.11 3,5, 9, 11
2,3,5.6 2,5,10,11 3,5, 1C I, 11
2,3,5,7 2, 6, 7, 8 3,6, 7, 8
2,3,5,8 2, 6, 7, 9 3,6, 7, 9
2,3,5,9 2, 6, 7, 10 3,6, 7, 10
2,3,5,10 2, 6, 7, 11 3,6, 7, 11
2,3,5,11 2, 6, 8, 9 3,6, 8, 9
2, 3, 6, 7 2, 6, 8, 10 3,6, 8, 10
2,3, 6,8 2,6,8, 11 3,6, 8, 11
2,3, 66,,99 2, 6, 9, 10 3,6, 9, 10
2,3, 66,, 1100 2, 6, 9, 11 3,6, 9, 11
2, 3, 6, 1111 2.6, 10, 11 3,6, 10 K 11
2,3,7, 8 2, 7, 8, 9 3,7 8 9
2 3 7 9 2.7.8, 10 3,7, 8, 10
2,3,7, 10 2, 7, 8, 11 3,7, 8, 11
2 3 7 11 2.7.9, 10 3,7, 9, 10
8,9 2,7,9, 11 3,7, 9, 11
2,3,8, 10 2.7, 10, 11 3,7, 10 ', 11
A j, 8, 11 2,8,9, 10 3,8, 9. 10
A J, 9, 10 2,8,9, 11 3,8, 9, 11
2,3 9, 11 2.8, 10, 11 3,8, 10 , 11
2,3 10, 11 2.9, 10, 11 3.9, 10 , 11
2,4 5,6 3,4,5,6 4.5. 6, 7
2,4.5,7 3,4,5,7 4.5. 6, 8
2,4,5 3,4 5,8 4,5. 6, 9
2,4,5 9 3,4 5,9 4.5, 6, 10
2,4,5 10 3,4,5,10 4.5, 6, U
2.4,5 11 3,4,5, 11 4,5, 7, 8
2.4,6 7 3, 4, 6, 7 4,5, 7, 9
2,4,6, 8 3, 4, 6, 8 4.5, 7 10 2,4,6,9 3,4 6,9 4.5, 7, 11 2,4,6, . 10 4 6, 10 4,5, 8, 9 4.6.11 3.4.6, 11 4,5, 8, 10
2.4.7, 8 3, 4.7.8 4,5, 8, 11 2.4.7,9 3,4,7,9 4,5, 9, 10 2,4.7 10 3.4.7, 10 4,5, 9, 11 2, 4, 7, 11 3.4.7, 11 4,5, 10 , 11 2.4.8, 9 3.4,8,9 4,6, 7, 8
2.4.8.10 3.4.8.10 4,6, 7, 9 2.4.8.11 3.4.8.11 4,6, 7, 10 4, 6, 7, 11 1,2,3,4,9 1.2.5.8, 11 4, 6, 8, 9 1,2,3,4, 10 1.2.5.9, 10 4, 6, 8, 10 1.2.3.4, 11 1,2.5.9, 11 4, 6, 8, 11 1,2,3,5,6 1.2.5, 10, 11
4,6,9 10 1,2.3,5,7 1,2.6,7,8
4,6,,99, 11 1,2.3,5,8 1.2.6,7,9
4, 6, 10, 11 1.2.3,5,9 1.2.6,7, 10
4,7, 9 1.2.3.5, 10 1.2.6.7, 11 4,7, 10 1.2.3.5, 11 1,2.6,8,9 4,7, 8.11 1.2.3, 6,7 1.2.6.8, 10
4, 7, 9, 10 1.2.3, 6,8 1.2.6.8, U 4, 7, 9.11 1.2.3, 5,9 1.2.6.9, 10 4,7, 10.11 1,2.3, 5, 10 1,2, 6, 9, 11 4,8,9.10 1.2.3.6, 11 1.2.6, 10, 11 4,8,9, 11 1.2.3.7, 8 1,2,7,8,9 4, 8, 10, 11 1,2,3,7.9 1,2,7,8, 10 10,11 1,2,3,7, 10 1.2.7.8, 11
5, 6, 7, 8 1.2.3.7, 11 1.2.7.9, 10 6,7,9 1,2,3,8,9 1,2,7,9, 11 6, 7, 10 1.2.3.8, 10 1.2.7, 10, 11
5.6.7, 11 1,2,3, 8, 11 1,2,8,9, 10 5.6, 8, 9 1,2,3, 9, 10 1,2, 8, 9, 11 5, 6, 8, 10 1,2, 3,9, 11 1.2.8, 10, 11
5.6.8, U 1,2,3, 10, 11 1.2.9, 10, 11 5.6, 9, 10 1,2,4,5.6 1,3,4,5,6
5.6.9, 11 1,2,4,5,7 1,3,4,5,7
5.6.10, 11 1,2,4,5.8 1,3,4,5,8 5.7.8,9 1,2,4,5.9 1,3,4,5,9 5.7,8, 10 1,2,4,5.10 1.3.4.5.10
5.7.8, 11 1.2.4.5, 11 1,3,4.5. U
5.7.9, 10 1,2,4,6,7 1,3,4,6.7
5, 7, 9, 11 1,2,4,6,8 1,3,4,6,8
5.7, 10, 11 1,2,4,6,9 1,3,4.6,9 5.8,9, 10 1.2.4.6, 10 1.3.4.6, 10 5.8,9, 11 1.2.4.6.11 1.3.4.6, 11
5.8, 10, 11 1,2,4.7,8 1,3,4.7,8
5.9, 10, 11 1,2,4,7,9 1.3.4.7,9
6, 7, 8, 9 1.2.4.7, 10 1.3.4.7, 10
6.7,8, 10 1.2.4.7, 11 1.3.4.7, 11
6.7.8, 11 1,2,4.8,9 1,3.4, 8,9
6.7.9, 10 1.2.4.8, 10 1,3,4, 8, 10 6.7,9, 11 1.2.4.8, 11 1.3.4.8, 11
6.7, 10, 11 1.2.4.9, 10 1.3.4.9, 10 6,8, 9, 10 1,2, 4.9, 11 1.3.4.9, 11 6.8, 9, Il 1.2.4.10, 11 1.3.4, 10, 11 6.8, 10, U 1,2,5.6,7 1.3.5,6,7 6.9, 10, U 1.2.5.6, 8 1.3,5,6,8 7.8, 9, 10 1,2,5,6,9 1,3,5,6,9 7,8, 9, 11 1,2,5,6, 10 1,3,5,6, 10 7.8, 10, 11 1.2.5.6, 11 3,5, 6, 11
7.9, 10, 11 1,2,5,7,8 3,5, 7,8
8, 9, 10, 11 1.2,5,7,9 3,5, 7,9 1.2,3,4.5 1.2.5.7, 10 3,5, 7, 10 1,2,3,4.6 1.2.5.7, 11 3,5, 7, 11 1.2,3,4.7 1,2.5,8,9 1,3,5,8,9 1.2,3,4,8 1.2.5.8, 10 1,3,5,8, 10 1, 3, 5. 8. 11 1.4,8, 9, 11 2,3,4.9, 10
1,3,5.9, 10 1. 4,8, 10, 11 23 4 9, 11
1,3,5.9, 11 1, 4,9, 10, 11 7 3 4 1C ), 11
1,3.5.1C >, 11 1, 5,6, 7,8 2.3.5, 6, 7
1.3.6,7, 8 1, 5,6, 7,9 2.3.5, 6, 8
1.3.6,7, 9 1, 5,6, 7, 10 2.3.5, 6, 9
I. j, 6, 7, 10 1, 5,6, 7,11 2.3.5, 6, 10
1,3.6,7, 11 1, 5,6, 8.9 2.3.5, 6, 11
1.3, 6, 8, 9 1, 5,6, 8.10 2,3,5, 7, 8
1.3,6,8, 10 1, 5,6, 8.11 2,3,5, 7, 9
1,3,6,8, 11 1, 5,6, 9.10 2,3,5, 7, 10
1,3,6,9, 10 1, 5,6, 9.11 2,3,5, 7. 11
1,3,6,9, 11 1, 5,6, 10, 11 2,3,5, 8. 9
1,3,6, 10i, 11 1, 5,7, 8.9 2,3,5, 8, 10
1,3,7,8. 9 1, 5,7, 8, 10 2,3,5. 8. 11
1,3,7,8. 10 1, 5.7, 8, 11 2, 3, 5, 9, 10
1,3,7,8. 11 1, 5.7. 9, 10 2, 3, 5, 9, 11
1,3,7,9. 10 1, 5,7, 9, 11 2,3,5. 10 Ml
1,3,7,9, 11 1. 5,7, 10, 11 2,3,6, 7, 8
1,3,7, 10 ',11 1. 5,8, 9, 10 2,3,6, 7, 9
1,3,8,9, 10 1, 5,8, 9,11 2,3,6, 7, 10
1,3,8,9, 11 1. 5,8, 10, 11 2.3.6, 7, 11
1,3,8, 10 ', 11 1, 5,9, 10, 11 2.3.6, 8, 9
1,3,9, 10 ', 11 1, 6,7, 8,9 2.3.6, 8, 10
1,4,5,6, 7 1, 6,7, 8, 10 2.3,6, 8, 11
1,4.5,6, 8 1, 6,7, 8, 11 2.3, 6, 9, 10
1,4.5,6, 9 1, 6,7, 9, 10 2, 3, 6, 9, 11
1,4.5,6, 10 1, 6,7, 9, 11 2, 3, 6, 10 , 11
1.4.5,6, 11 1, 6,7, 10, 11 2, 3, 7, 8, 9
1,4,5,7, 8 1, 6,8, 9, 10 2, 3, 7, 8, 10
1.4,5,7, 9 1, 6,8, 9, 11 2,3,7, 8, 11
1,4,5,7, 10 1, 6,8, 10, 11 2,3,7, 9. 10
1.4,5,7, 11 1, 6,9, 10.11 2, 3, 7, 9. 11
1,4,5,8, 9 1, 7,8, 9.10 2,3,7, 10.11
1,4,5,8, 10 1, 7,8. 9.11 2,3,8, 9, 10
1,4,5,8, 11 1, 7,8. 10.11 2,3,8, 9, 11
1,4,5,9. 10 1, 7,9. 10, 11 2, 3, 8. 10 , ii
1, 4, 5, 9. 11 1, 8,9. 10, 11 2,3,9, 10 , ii
1,4,5, 10 , 11 2, 3,4, 5,6 2,4,5, 6, 7
1,4,6,7. 8 2, 3.4, 5,7 2,4,5, 6, 8
1,4,6,7, 9 2 3.4, 5,8 2,4,5, 6, 9
1,4,6,7, 10 2 3.4, 5,9 2,4,5, 6, 10
1,4,6.7, 11 3.4, 5,10 2.4.5, 6, 11
1,4,6.8, 9 2. 3,4, 5, 11 2, 4.5, 7, 8
1,4,6.8, 10 2. 3,4, 6,7 2, 4.5, 7, 9
1,4,6.8, 11 2. 3,4, 6,8 2, 4.5, 7, 10
1 , 4, 6.9, 10 2. 3,4, 6,9 2, 4, 5, 7, 11
1,4,6,9, 11 2, 3,4, 6, 10 2,4,5, 8, 9
1,4,6.10 , 11 2, 3,4, 6, 11 2,4,5, 8, 10
1,4,7.8, 9 3,4, 7,8 2, 4, 5, 8, 11
1,4.7,8, 10 i 2, 3,4, 7,9 2,4,5, 9, 10
1,4.7,8, 11 2, 3,4, 7, 10 2,4,5, 9. 11
1,4,7,9, 10 2, 3,4, 7, 11 2,4,5, 10 .11
1,4.7,9, 11 2, 3,4, 8,9 2,4,6. 7. 8
1,4,7, 10 ,11 2; 3,4, 8, 10 2,4,6, 7. 9
1,4.8,9, 10 2, 3,4, 8, 11 2,4,6. 7, 10 2, 4, 6, 7, 11 3, 4, 5, 6, 11 3, 6, 7, 10, 11
2,4,6,8,9 3, 4, 5, 7, 8 3.6,8, 9, 10
2, 4, 6, 8, 10 3, 4, 5, 7, 9 3,6,8, 9, 11
2, 4, 6, 8, 11 3.4.5.7, 10 3.6,8, 10, 11
2,4,6 9, 10 3.4^5,7, 11 3, 6, 9, 10, 11
2,46, 9, 11 3, 4.5, 8, 9 3, 7, 8, 9, 10
2,4,6 10, 11 3.4.5.8, 10 3, 7, 8, 9, 11
2,4,7 8,9 3, 4, 5, 8.11 3,7,8, 10, 11
22,,44, 7, 88,, 1100 3.4.5.9, 10 3.7, 9, 10, 11
2, 4, 7, 8. 11 3,4,5,9, 11 3,8,9, 10, 11
2, 4, 7, 9. IO 3^4,5, 10, 11 4, 5, 6, 7.8
22,, 44,, 77,, 9; 11 3, 4, 6, 7, 8 4,5,6, 7,9 2 ~), 44,, 77,, 10 , ii 3^ 4, 6, 7, 9 4, 5, 6, 7, 10
22,, 44,, 88.. 9. 10 3,4,6,7, 10 4,5,6, 7.11
22,, 44,, 88.. 9, 11 3.4.6.7, 11 4, 5, 6, 8,9
22,, 44,, 88,. 10 ,11 3,4,6,8,9 4, 5, 6, 8, 10
2, 4, 9, 10 ,11 3.4.6.8, 10 4,5,6. 8, 11
2, 5, 6, 7, 3,4,6,8,11 4,5,6, 9, 10
2,5 9 3.4.6.9, 10 4, 5, 6, 9, 11 2 5,6,7, 10 3,4,6,9, 11 4, 5.6, 10, 11
2.5.6.7, 11 3.4.6, 10, 11 4, 5, 7, 8,9 2, 5, 6, 8, 9 3, 4, 7, 8, 9 4, 5, 7, 8, 10 2, 5, 6, 8, 10 3.4.7, 8, 10 4, 5, 7, 8, 11
2.5.6.8, 11 3,4,7, 8, 11 4, 5, 7, 9, 10
2.5.6.9, 10 3,4,7,9, 10 4, 5, 7, 9,11 2,5,6,9, 11 3,4,7,9, 11 4, 5, 7, 10, 11
2.5.6, 10, 11 3.4.7, 10, 11 4, 5, 8, 9, 10 2, 5, 7, 8, 9 3,4,8.9, 10 4, 5, 8, 9, 11 2,5,7,8, 10 3, 4, 89, 11 4.5.8, 10, Il
2.5.7.8, 11 3.4.8, 10, 11 i 5, 9, 10, 11
2.5.7.9, 10 3.4.9, 10, 11 4, 6, 7, 8,9 2,5,7,9, 11 3,5,6.7,8 4, 6, 7, 8, 10
2.5.7, 10, 11 3,5,6,7,9 4, 6, 7, 8, 11 2,5, ,9, 10 3,5,6,7,10 4, 6, 7, 9, 10
2,5, ,9, 11 3.5.6.7, 11 4, 6, 7, 9, 11
2, 5, , 10, 11 3,5,6,8,9 4, 6, 7, 10, 11
2,5,9, 10, 11 3,5,6, 8, 10 4, 6, 8, 9, 10
2,6,7 ,9 3.5.6.8, 11 4, 6, 8, 9, 11 2, 6, 7 , 10 3.5.6.9, 10 4, 6, 8, 10, 11
2.6.7.8, 11 3, 5, 6, 9, 11 4, 6, 9, 10, 11
2.6.7.9, 10 3.5.6, 10, 11 4, 7, 8, 9.10 2.6, 7, 9, 11 3,5,7,8,9 4, 7, 8, 9, 11
2.6.7, 10, 11 3.5.7.8, 10 4, 7, 8, 10.11 2,6,8,9, 10 3.5.7, 8, 11 4, 7, 9, 10, 11 2.6,8,9, 11 3.5.7.9, 10 4, 8, 9, 10, 11
2.6.8, 10, 11 3,5,7,9, 11 5,6,7. 8,9
2.6.9, 10, 11 5,7, 10, 11 5, 6, 7, 8, 10 2,7,8,9, 10 3,5 8, 9, 10 5,6,7. 8, 11 2,7,8,9, 11 3, 5 8, 9, 11 5, 6, 7, 9, 10
2.7, 10, 11 3,5 8, 10, 11 5,6,7, 9,11
2,7,9, 10.11 3,5.9, 10, 11 5,6,7, 10, 11
2,8,9, 10, 11 3, 6.7, 8, 9 5,6,8, 9, 10
3,4,5,6,7 3.6,7,8, 10 5, 6, 8, 9, 11
3,4, ,6.8 3.6.7.8, 11 5,6.8, 10, 11
3,4, 6,9 3.6.7.9, 10 5, 6, 9, 10, 11
3,4,5,6.10 3, 6, 7, 9, 11 5,7.8, 9, 10 5, 7. 8, 9, 1 1 1,2,3.7,8,9 1,2,5,7,8, 10
5, 7. 8, 10, 11 2,3,7,8, 10 1.2.5.7.8, 11
5, 7. 9, 10, I l ,3, 7, 8, 11 1.2.5.7.9, 10
5. 8, 9, 10, 1 1 7, 9, 10 1,2,5,7.9, 11
6. 7, 8, 9, 10 7, 9, 11 1.2.5.7, 10, 11
6. 7, 8, 9, 1 1 1,2 7, 10, 11 1,2,5.8,9, 10
6. 7, 8, 10, 1 1 1, 2 8,9.10 1,2,5.8,9, 11
6, 7, 9, 10, I l 1,22.3,8,9.11 1.2.5.8, 10.11
6, 8, 9, 10, 1 1 9 S, 10, 11 1.2.5.9, 10.11
7, 8, 9, 10, 1 1 1.2.3,9, 10, 11 1,2,6.7,8,9
1 9 3, 4, 5 6 1.2,4,5,6,7 1,2.6,7,8, 10
1 7 3, 4, 5 7 1.2,4,5,6,8 1.2.6.7.8, 11
1 9 3, 4, 5 8 1.2,4,5,6,9 1.2.6.7.9, 10
1. 2. 3. 4. 5 . 9 1.2,4,5.6, 10 1,2.6,7,9.11
1.2,3.4,5, 10 1,2,4,5.6, 11 1.2.6.7, 10, 11 1,2,3.4,5, 11 1,2,4,5.7,8 1.2,6,8,9.10
2,3,4,6,7 1,2,4,5.7,9 1,2,6,8,9, 11
2,3,4. 6,8 1,2,4,5.7,10 1.2.6.8, 10, 11 2.3, 4.6,9 1,2,4,5,7,11 1.2.6.9, 10, 11
2.3.4.6.10 1,2,4.5,8,9 1,2,7,8.9, 10 2.3, 4, 6.11 1.2.4.5.8, 10 1,2,7,8,9, 11
2,3,4,7,8 1,2,4,5,8,11 1.2.7.8, 10, 11 2,3,4,7.9 1.2.4.5.9, 10 1.2.7.9, 10, 11
2.3.4.7, 10 1,2,4,5,9,11 1,2,8,9, 10, 11 2,3,4,7, 11 1,2,4,5, 10, 11 1,3,4.5,6,7
1,2,3,4,8,9 1,2,4,6,7,8 1,3,4,5,6,8
I -S 4 X I I I 1,2,4,6,7,9 1,3,4,5,6,9
I 1 2, 3, 4, 8, 1 1 1,2,4,6,7.10 1,3,4,5,6,10
1 9 T, A o in 1.2.4.6.7, 11 1.3.4.5.6, 11
1 , 9 3, 4, 9, 1 1 1,2,4,6,8,9 1.3,4,5,7,8
1 , 9 3, 4, 10. 1 1.2.4.6.8, 10 1.3,4,5,7.9
1 1,2 7, 33,5 S. 6 7 1.2.4,6,8,11 1,3,4,5,7.10
1,2,3.5,6,8 1.2.4.6.9, 10 1,3,4,5,7.11
1,2,3.5,6,9 1,2, 4, 6, 9, 11 1,3,4,5,8,9
1,2,3.5,6, 10 1.2.4.6, 10, 11 1,3,4,5,8, 10 1,2,3,5,6, 11 1.2,4,7,8,9 1.3.4.5.8, 11
1,2,3,5, 1.2,4,7.8, 10 1.3.4.5.9, 10 1,2,3,5, 1.2.4.7.8, 11 1,3,4,5,9, Ii
3,5,7, 10 1.2.4.7.9, 10 1,3,4, 5, 10, 11
2,3,5,7, 11 1,2,4,7.9, 11 1,3,4,6,7,8
JL, 5, :>.8.9 1.2.4.7, 10, 11 1,3,4,6,7,9
2.3, 5.8.10 1,2,4,8,9, 10 1.3.4.6.7, 10 2, 3, 5. 8, 11 1,2,4.8,9,11 1,3,4, 6, 7, 11 2,3,5. 9, 10 1.2.4.8, 10, 11 1,3,4,6,8,9
.2, 3, 5. 9, 11 1.2.4.9, 10, 11 1.3.4.6.8, 10 1.2,3, 5. 10, 11 1,2,5.6,7,8 1.3.4.6.8, 11 1.2,3, 6, 7, 8 1,2,5.6,7,9 1.3.4.6.9.10 3,6,7,9 1,2,5.6,7, 10 1,3.4,6,9.11 3,6,7, 10 1.2.5.6.7, 11 1,3.4, 6, 10, 11
2.3.6.7, 11 1,2,5.6,8,9 1.3,4,7,8.9 2, 3, 6, 8, 9 1.2.5.6.8, 10 1,3,4,7,8.10
2.3.6.8, 10 1.2.5.6.8, 11 1.3.4.7.8, 11
2.3.6.8, 11 1.2.5.6.9, 10 1.3.4.7.9, 10
2.3.6.9, 10 1.2.5,6, 9, 11 1,3,4,7.9, 11 2,3.6,9,11 1.2.5,6, 10, 11 1,3,4,7, 10, 11
1,2,3.6, [0, 11 1.2,5,7,8,9 1,3,4,8,9, 10 1,3,4, 8, 9, 11 > 4, 5, 8, 10, 11 2, 3, 4, 6, 7, 11
1,3,4. 8, 10, 11 , 4.5, 9, 10, 11 2,3,4,6,8,9
1,3,4. 9.10, 11 .4, 6, 7,8,9 2.3.4.6.8, 10
1,3,5, 6, 7, 8 4.6, 7.8, 10 2, 3, 4, 6, 8, 11
1,3,5. 6, 7, 9 , 4, 6, 7, 8, 11 2.3.4.6.9, 10
1,3,5, 6,7,10 4. ,6, 7.9, 10 2,3,4,6,9, 11
1,3,5, 6, 7, 11 4. 7, 9.11 2.3.4.6, 10, 11
1.3,5, 6, 8, 9 4,6, 7, 10, 11 2,3,4,7,8,9
1.3,5, 6,8, 10 1.4,6, 8,9.10 2,3.4,7,8, 10
1.3.5, 6, 8, 11 1,4,6, 8, 9.11 2.3.4.7.8, 11
1,3,5, 6,9, 10 1,4,6, 8, 10, 11 2.3.4.7.9, 10
1,3,5, 6, 9, 11 1,4,6, 9, 10, 11 2,3,4,7,9, 11
1.3.5, 6, 10, 11 4,7, 8,9, 10 2.3.4.7, 10, 11
1,3.5, 7,8,9 4,7, 8,9, 11 2,3,4,8,9, 10
1,3,5, 7,8,10 4,7, 8, 10, 11 2,3,4,8,9, 11
1,3,5, 7, 8, 11 1,4, 7, 9, 10, 11 2.3.4.8, 10, 11
1, 3, 5, 7, 9, 10 1, 4,8, 9, 10, 11 2.3.4.9, 10, 11
1,3,5, 7, 9, 11 1,5, 6, 7,8,9 2, 3, 5, 6, 7, 8
1,3,5, 7, 10, 11 1,5, 6, 7, 8, 10 2, 3, 5, 6, 7, 9
1,3,5, 8, 9, 10 1, 5, 6, 7, 8, 11 2,3,5,6,7,10
1 3 5 S 9 11 1,5,6, 7,9, 10 2.3.5.6.7, 11
1,3,5, 8, 10, 11 1,5, ,6, 7, 9, 11 2, 3, 5, 6, 8, 9
1,3,5, 9, 10, 11 1,5, 6, 7, 10, 11 2.3.5.6.8, 10
1 3 6 7 89 8,9, 10 2.3.5.6.8, 11
1,3,6, 7, 8, 10 8, 9, 11 2.3.5.6.9, 10
1 36 78 11 5,6. 8, 10, 11 2,3,5,6,9, 11
1,3,6. 7, 9, 10 1,5,6, 9, 10, 11 2.3.5.6, 10, 11
1,3,6, 7, 9, 11 1,5,7, 8, 9, 10 2, 3, 5, 7, 8, 9
11,3,6, 7, 10, 11 1,5,7, 8, 9, 11 I13, 5? 7, 8, 10
11,,33,6, 8,9, 10 5,7, 8, 10, 11 2.3.5.7.8, 11
1,3,6, 8,9,11 5,7, 9, 10, 11 2.3.5.7.9, 10
1,3,6. 8, 10.11 5,8, 9, 10, 11 2,3,5,7,9, 11
1,3,6, 9, 10, 11 6,7, 8, 9, 10 2.3.5.7, 10, 11
1,3,7, 8, 9, 10 1,6,7, 8, 9, 11 2, 3, 5, 8, 9, 10
1,3,7, 8, 9, 11 1,6,7, 8, 10, 11 2,3,5,8,9,11
1,3,7, 8, 10, 11 1,6,7, 9, 10, 11 2, 3, 5, 8, 10, 11
1,3,7, 9, 10, 11 1,6.8, 9, 10, 11 2,3,5,9, 10, 11
1,3,8, 9, 10, 11 1,7,8, 9,10,11 2,3,6,7,8,9
1,4,5, 6,7,8 2, 3, 4, 5,6,7 2, 3, 6, 7, 8, 10
1,4,5, 6,7,9 2,3,4, 5,6,8 2.3.6.7.8, 11
1 4 5 67 10 2,3,4, 5,6,9 2.3.6.7.9, 10
1 4 5 67 11 2.3,4, 5,6, 10 2,3,6,7,9, 11
1 4 S 689 2,3,4, 5.6.11 2.3.6.7, 10, 11
1.4,5, 6.8, 10 2, 3, 4, 5,7,8 2,3,6,8,9.10
1 4 S 68 11 2,3,4, 5,7,9 2, 3, 6, 8, 9.11
1 4 S 69 10 2,3,4, 5.7.10 2.3.6.8.10, 11
1,4,5, 6, 9, 11 2,3,4, 5, 7, 11 2, 3, 6, 9, 10, 11
1-4,5, 6, 10, 11 2.3, 4, 5,8,9 2,3,7,8,9, 10
1,4,5, 7, 8.9 2, 3, 4, 5.8.10 2,3,7,8,9, 11
1.4.5, 7,8,10 2,3,4, 5, 8, 11 2.3.7.8.10, 11
1,4.5, 7,8.11 2,3,4, 5, 9, 10 2.3.7.9.10, 11
1.4.5, 7,9.10 2,3,4, 5, 9, 11 2,3,8.9, 10, U
1,4.5, 7, 9, 11 2,3,4, 5, 10, 11 2,4,5.6,7,8
1,4,5, 7, 10, 11 2,3,4, 6,7,8 2,4,5,6,7,9
1,4,5, 8,9, 10 234 6, 7, 9 2,4,5,6,7, 10
1,4,5, 8, 9, 11 2,3,4, 6.7, 10 2,4,5,6,7, 11 2.4,5, 6,8,9 3.4, 5.6.8, 9 4,5,6,7,9, 11 2.4,5, 6,8, 10 3,4.5,6,8, 10 4.5.6.7, 10, 11 2.4,5, 6, 8, 11 3.4.5.6.8, 11 4,5,6.8,9, 10 2, 4, 5, 6, 9, 10 3.4.5.6.9, 10 4, 5.6.8, 9, 11 2.4, 5, 6,9,11 3, 4.5.6, 9, 11 4.5.6.8, 10, 11 2.4, 5, 6, 10, U 3.4.5.6, 10, 11 4.5.6.9, 10, 11 2, 4, 5, 7,8,9 3, 4.5, 7, 8, 9 4,5,7.8,9, 10 2, 4.5, 7.8, 10 3,4.5,7.8, 10 4.5,7.8,9, 11
2,4, 5, 7, 8, 11 3.4.5.7.8, 11 4.5.7.8, 10, 11 2,4.5, 7.9, 10 3.4.5.7.9, 10 4.5.7.9, 10, 11 2,4.5, 7, 9, 11 3,4.5,7,9, 11 4.5,8,9, 10, 11 2, .4.5, 7, 10.11 3.4.5.7, 10, 11 4.6,7,8,9, 10 2, 4.5, 8,9, 10 3,4,5, 8,9, 10 4.6, 7, 8, 9, 11 2, 4.5, 8, 9, 11 3,4,5,8,9, 11 4.6.7.8, 10, 11 2, 4.5, 8, 10.11 3.4.5.8, 10, 11 4.6.7.9, 10, 11 2,4.5, 9.10.11 3.4.5.9, 10, 11 4,6,8,9, 10, 11 2, 4, 6, 7.8,9 3,4,6,7,8,9 4,7,8,9, 10, 11 2, 4, 6.7, 8, 10 3.4.6.7.8, 10 5,6,7,8,9,10 2, 4, 6.7,8,11 3, 4, 6, 7, 8, 11 5,6,7,8,9, 11 2, 4.6.7, 9, 10 3.4.6.7.9, 10 5.6.7.8, 10.11 2, 4, 6.7, 9, 11 3,4,6,7,9,11 5.6.7.9, 10, 11 2, 4, 6. 7.10, 11 3.4.6.7, 10, 11 5,6,8,9, 10, 11 2,4,6. 8, 9, 10 3,4,6,8,9, 10 5,7,8,9, 10, 11 2, 4, 6.8, 9, 11 3,4,6,8,9, 11 6,7,8,9, 10, 11 2, 4, 6. 8, 10, 11 3.4.6.8, 10, 11 1,2,3,4,5,6,7 2, 4, 6, 9, 10, 11 3.4.6.9, 10, 11 1,2,3,4,5,6,8 2, 4, 7.8, 9, 10 3,4,7,8,9, 10 1,2,3,4.5.6,9 2, 4, 7, 8.9.11 3,4,7,8,9, 11 1,2,3,4.5,6, 10 2, 4, 7, 8, 10, 11 3 478 10.11 1.2.3.4.5.6.11 2, 4, 7, 9, 10, 11 3,4,7,9, 10.11 1,2,3,4.5,7,8 2, 4, 8, 9, 10, 11 3,4,8,9, 10, 11 1,2,3,4,5,7,9 2, 5, 6, 7, 8.9 3,5,6,7,8,9 1.2.3.4.5.7, 10 2, 5, 6, 7, 8.10 3,5,6,7,8, 10 1.2.3.4.5.7, 11 2,5,6, 7,8,11 3.5.6.7.8, 11 1,2,3,4.5,8,9 2,5,6, 7, 9, 10 3,5,6,7,9,10 1.2.3.4.5.8, 10 2, 5, 6, 7,9,11 3.5.6.7.9, 11 1.2.3.4.5.8, 11 2,5,6, 7, 10, 11 3.5.6.7, 10. Il 1.2.3.4.5.9, 10 2, 5, 6, 8,9, 10 3,5,6,8,9, 10 1,2,3,4,5,9, 11 2,5,6, 8, 9, 11 3,5,6,8,9, 11 1.2.3.4.5, 10, 11 2, 5, 6, 8, 10, 11 3.5.6.8, 10, 11 1.2.3,4,6,7,8 2,5,6, 9, 10, 11 3.5.6.9, 10, 11 1.2.3,4,6,7,9 2, 5, 7, 8, 9, 10 3,5,7,8,9, 10 1.2.3.4,6,7, 10
/, J, /, 8,9,1 3, 5, 7, 8, 9, 11 1.2.3.4.6.7, 11
2,5,7, 8, 10, 3.5.7.8, 10.11 1.2,3,4,6,8,9
2, 5, 7, 9, 10, 3.5.7.9, 10. U 1.2.3.4.6.8, 10
2,5,8, 9, 10, 3.5,8,9, 10, 11 1.2.3.4.6.8, 11
2,6,7, 8, 9, 10 3,6,7,8,9.10 1.2.3.4.6.9, 10
2,6,7, 8,9, 1 3,6,7,8,9, 11 1.2,3,4,6,9, 11
2,6,7, 8, 10, 3.6.7.8, 10, 11 1.2.3.4.6, 10.11
2,6,7, 9, 10, 11 3.6.7.9, 10, 11 1.2,3,4,7,8,9
2,6, 8, 9, 10, 11 3,6,8,9, 10, 11 1.2,3,4,7,8.10
2,7, 9, 10, 11 3,7,8,9, 10, 11 1.2.3.4.7.8.11
3,4, 6,7,8 4,5,6,7,8.9 1.2.3.4.7.9.10
3,4, 6,7,9 4.5,6,7,8.10 1,2,3,4,7,9.11
3,4, 6,7,10 4.5,6.7.8.11 1.2.3.4.7, 10, 11
3,4,5, 6, 7, 11 4.5, 6, 7, 9, 10 1,2,3,4,8,9.10 1.2.3.4, 8, 9, 11 1.2.4.5.8, 10, 11 1.3.4.5.8, 10, 11
1.2.3.4.8, 10, 11 1.2.4.5.9, 10, 11 1.3.4.5.9, 10, 11
1.2.3.4.9, 10, 11 1,2,4,6,7,8,9 1,3,4,6.7,8,9 1,2,3,5,6,7,8 1,2,4,6,7,8,10 1,3,4,6,7,8, 10 1,2,3,5.6,7,9 1.2.4.6.7.8, 11 1.3.4.6.7.8, 11 1,2,3,5,6,7,10 1.2.4.6.7.9, 10 1.3.4.6.7.9, 10
1.2.3.5.6.7, 11 1,2,4,6,7,9, 11 1,3,4,6,7,9, 11 1,2,3,5,6,8,9 1.2.4.6.7, 10, 11 1.3.4.6.7, 10, 11
1.2.3.5.6.8, 10 1, 2, 4, 6, 8, 9.10 1.3,4,6,8,9, 10
1.2.3.5.6.8, 11 1,2,4.6,8,9.11 1,3,4,6,8,9, 11 1,2,3,5,6^9, 10 L 2, 4, 6, 8, 10, 11 1.3.4.6.8, 10, 11
1.2.3.5.6.9, 11 1,2,4,6,9, 10, 11 1.3.4.6.9, 10, 11
1.2.3.5.6, 10, 11 1, 2, 4, 7, 8, 9, 10 1,3,4,7,8,9, 10 1,2,3,5,7,8,9 1.2,4.7,8,9, 11 1,3,4,7,8,9, 11 1,2,3,5,7,8, 10 1.2.4.7.8, 10, 11 1.3.4.7.8, 10, 11
1.2.3.5.7.8, 11 1.2.4.7.9, 10.11 1.3.4.7.9.10, 11
1.2.3.5.7.9, 10 1,2,4, 8,9, 10, 11 1,3,4,8,9, 10, 11 1,2.3,5,7,9, 11 1, 2.5, 6, 7.8, 9 1,3,5,6,7.8,9
1.2.3.5.7, 10, 11 1,2,5,6,7,8,10 1,3,5,6,7,8, 10 1,2,3,5,8,9, 10 1,2,5,6,7,8,11 1.3.5.6.7.8, 11 1,2,3,5,8,9, 11 1,2,5,6,7,9, 10 1.3.5.6.7.9, 10
1.2.3.5, 8, 10, 11 1,2,5,6,7,9, 11 1,3,5.6,7,9,11 1,2.3,5,9, 10, 11 1.2.5.6.7, 10, 11 1.3.5.6.7, 10, 11 1,2,3,6,7,8,9 1,2,5,6,8,9, 10 1,3,5,6,8,9,10
1, 2, 3, 6, 7, 8, 10 1,2,5,6,8,9, Il 1,3,5,6,8,9,11
1.2.3.6.7.8, 11 1.2.5.6.8, 10, 11 1.3.5.6.8, 10, 11
1.2.3.6.7.9, 10 1.2.5.6.9, 10, 11 1.3.5.6.9, 10, 11 1,2,3,6,7,9, 11 1,2,5,7,8,9, 10 1,3,5,7,8,9, 10
1.2.3.6.7, 10, 11 1,2,5,7,8,9, 11 1,3,5,7,8,9, 11 1,2,3,6,8,9, 10 1.2.5.7.8, 10, 11 1.3.5.7.8, 10, 11 1,2,3,6,8,9, U 1.2.5.7.9, 10, 11 1.3.5.7.9, 10,11
1.2.3.6.8, 10, 11 1,2,5,8,9, 10, 11 1,3,5,8,9, 10, 11
1.2.3.6.9, 10, 11 1,2,6,7,8,9, 10 1, 3, 6, 7, 8, 9, 10 1,2,3,7,8,9, 10 1,2,6,7,8,9, 11 1,3,6,7,8,9, 11 1,2,3,7,8,9, 11 1.2.6.7.8, 10, 11 1.3.6.7.8, 10, 11
1.2.3.7.8, 10, 11 1.2.6.7.9, 10, 11 1.3.6.7.9, 10, 11
1.2.3.7.9, 10, 11 1,2,6,8,9, 10, 11 1,3,6,8,9, 10, 11 1,2,3,8,9, 10, 11 1.2.7,8.9, 10, 11 1,3,7,8,9, 10, 11 1,2,4,5,6,7,8 1.3,4.5,6,7,8 1,4,5,6,7.8,9 1,2,4,5,6,7,9 1.3,4,5.6,7,9 1,4,5,6,7,8, 10 1,2,4,5,6,7, 10 1, 3, 4, 5, 6, 7, 10 1.4.5.6.7.8, 11
1.2.4.5.6.7, 11 1.3.4.5.6.7, 11 1.4.5.6.7.9, 10 1,2,4,5,6,8,9 1,3,4,5,6,8,9 1,4,5,6,7,9, 11
1.2.4.5.6.8, 10 1.3.4.5.6.8, 10 1.4.5.6.7, 10, 11
1.2.4.5.6.8, 11 1.3.4.5.6.8, 11 1,4,5.6,8,9, 10
1.2.4.5.6.9, 10 1.3.4.5.6.9, 10 1,4,5.6,8,9, 11 1.2,4,5,6,9, 11 1,3,4,5,6,9, 11 1,4,5.6, 8, 10, 11
2,4,5,6, 10, 11 1,3,4,5,6.10, 11 1,4,5,6,9, 10, 11 2, 4, 5, 7, 8, 9 1,3,4,5,7,8,9 1,4,5.7,8,9, 10
2.4.5.7.8, 10 1,3,4,5,7,8, 10 1,4,5,7,8,9, 11 2, 4, 5, 7, 8.11 1.3.4.5.7.8, 11 1.4.5.7.8, 10, 11
2.4.5.7.9, 10 1.3.4.5.7.9, 10 1.4.5.7.9, 10, 11 2, 4, 5, 7, 9, 11 1,3,4,5,7,9, 11 1,4,5,8,9, 10, 11 2,4,5,7, 10.11 1,3,4, 5, 7, 10, 11 1,4,6,7,8,9, 10
1.2,4,5, 8.9, 10 1,3,4,5,8,9, 10 1,4,6,7,8,9.11 1,2,4,5,8,9, 11 1,3,4,5,8,9, 11 1,4,6,7,8, 10, 11 1,4,6, 7,9, 10.11 2.3, 5, 7, 8, 9, 11 3,4,6.7,9, 10, 11
1,4,6, 8.9.10.11 2.3.5.7.8.10.11 3,4,6.8,9, 10, 11
1,4,7, 8,9, 10.11 2.3.5.7.9.10, 11 3,4.7,8,9, 10, 11
1,5,6.7.8.9.10 2,3,5,8.9.10.11 3,5.6,7,8,9, 10
1,5,6.7,8.9.11 2,3,6,7.8,9, 10 3, 5.6, 7, 8, 9, 11
1,5.6, 7.8, 10, 11 2,3,6,7,8,9, 11 3.5.6.7.8, 10.11
1,5.6, 7, 9, 10, 11 2.3.6.7.8, 10, 11 3.5.6.7.9.10, 11
1.5.6, 8.9, 10, 11 2.3.6.7.9, 10, 11 3,5,6,8,9.10, 11
1.5.7, 8,9, 10, 11 2,3,6,8,9, 10, 11 3.5,7,8,9.10, 11
1-6,7, 8,9, 10, 11 2,3,7.8,9, 10, 11 3,6,7,8,9.10, 11
2.3,4, 5, 6, 7, 8 2, 4, 5.6, 7, 8, 9 4,5,6,7.8.9, 10
2,3,4, 5,6,7,9 2,4,5,6,7,8, 10 4,5,6,7,8,9, 11
2, 3, 4, 5, 6, 7, 10 2.4.5.6.7.8, 11 4.5.6.7.8, 10, 11
2,3,4, 5.6.7, 11 2.4.5.6.7.9, 10 4.5.6.7.9, 10, 11
2,3,4, 5, 6, 8, 9 2,4,5,6,7,9, 11 4,5,6, 8, 9, 10, U
2,3,4, 5.6.8, 10 2.4.5.6.7, 10, 11 4,5,7.8,9, 10, 11
2,3,4, 5.6.8, 11 2,4.5,6,8,9, 10 4,6,7.8,9, 10, 11
2,3,4, 5.6.9, 10 2,4.5,6,8,9, 11 5,6,7,8,9, 10, 11
2,3,4, 5, 6, 9, 11 2.4.5.6.8, 10, 11 1,2,3,4,5,6,7,8
2,3,4. 5.6, 10.11 2.4.5.6.9, 10.11 1,2.3,4,5,6,7.9
2,3,4, 5, 7, 8, 9 2, 4, 5, 7, 8.9.10 1,2,3,4,5,6,7, 10
2,3,4, 5, 7, 8, 10 2.4,5,7,8.9, 11 1.2.3.4.5.6.7, 11
2,3,4, 5.7.8, 11 2.4.5.7.8, 10, 11 1.2,3,4,5,6.8,9
2,3,4, 5, 7, 9, 10 2.4.5.7.9, 10, 11 1.2.3.4.5.6.8, 10
2.3, 4, 5,7,9,11 2,4,5,8,9, 10, 11 1.2.3.4.5.6.8, 11
2, 3, 4, 5.7, 10, 11 2,4,6,7.8,9, 10 1.2.3.4.5.6.9, 10
2,3,4, 5.8.9, 10 2,4,6,7,8,9, 11 1,2,3,4,5,6,9, 11
2, 3.4, 5,8,9, 11 2.4.6.7.8, 10, 11 1.2.3.4.5.6, 10, 11
2 3 4 5.8, 10, 11 2.4.6.7.9, 10, 11 1,2, 3,4,5, 7,8, Q
2,3,4, 5.9, 10, 11 2,4,6,8,9, 10, 11 1,2,3,4,5,7,8, 10
2, 3, 4, 6, 7, 8, 9 2,4,7,8,9, 10, 11 1.2.3.4.5.7.8, 11
2.3,4, 6, 7, 8, 10 2,5,6.7,8,9, 10 1.2.3.4.5.7.9, 10
2,3,4, 6, 7, 8, 11 2,5,6,7,8,9, 11 1,2,3,4,5,7,9, U
2,3,4, 6,7,9, 10 2.5.6.7.8, 10, 11 1.2.3.4.5.7, 10, 11
2,3,4, 6,7,9, 11 2.5.6.7.9, 10, 11 1,2,3,4,5,8,9, 10
2,3,4, 6.7.10, 11 2,5,6,8,9, 10, 11 1,2,3.4,5,8,9, 11
2,3,4, 6.8.9.10 2,5.7,8,9, 10, 11 1.2.3.4.5.8, 10.11
2,3,4, 6.8.9.11 2.6,7,8,9, 10, 11 1.2.3.4.5.9, 10.11
2,3,4, 6.8.10.11 3.4.5,6,7,8.9 1,2.3,4,6,7,8.9
2,3,4, 6.9.10, 11 3.4.5,6,7,8.10 1.2.3,4,6,7,8, 10
2,3,4. 7.8, 9, 10 3.4.5.6.7.8.11 1.2.3.4.6.7.8, 11
2,3,4.7, 8, 9, 11 3.4, 5, 6, 7.9, 10 1.2.3.4.6.7.9, 10
2,3,4. 7.8, 10, 11 3.4.5.6.7.9, 11 1,2,3,4,6.7,9, 11
2,3,4. 7.9, 10, 11 3.4.5.6.7, 10, 11 1.2.3.4.6.7, 10, 11
2,3,4, 8,9, 10, 11 3,4,5,6.8.9, 10 1,2,3,4,6.8,9, 10
2,3, 5.6, 7, 8, 9 3,4,5,6.8,9, 11 1,2,3,4,6,8,9, U
2,3,5, 6.7.8, 10 3.4.5.6.8, 10, 11 1.2.3.4.6.8, 10, 11
2,3 5, 6, 7, 8, J 1 3.4.5.6.9, 10, 11 1.2.3.4.6.9, 10, 11 2, 3, 5, 6, 7, 9, J 0 3,4,5,7.8,9, 10 1,2,3,4.7,8,9, 10
A 5, 5, 6, 7, 9, 11 3,4,5,7.8, 9, 11 1,2,3.4.7,8,9, 11
2 3,.5, 6, 7, 10, 11 3.4.5.7.8, 10, 11 1.2.3.4.7.8, 10, 11 2,3.5, 6.8.9, 10 3.4.5.7.9, 10, 11 1.2.3.4.7.9, 10, 11 2.3.5, 6,8,9, 11 3,4,5,8,9, 10, 11 1,2,3.4,8,9, 10, 11 2.3 5, 6.8, 10, 11 3,4,6.7,8,9, 10 1,2.3,5,6,7,8,9 2.3.5, 6.9, 10, 11 3,4,6.7,8,9, 11 1,2.3,5,6,7,8.10 2.3, 5, 7 8 9.10 3,4.6,7,8, 10.11 1,2.3,5,6,7,8.11 1,2,3,5,6,7,9, 10 1,3,4,5,7,8,9, 10 2.4.5.6.7.8, 10, 11 1,2,3,5,6,7,9, 11 1,3,4,5,7,8,9, 11 2.4.5.6.7.9, 10, 11
1.2.3.5.6.7, 10, 11 1.3.4.5.7.8, 10, 11 2,4,5,6,8,9, 10, 11 1,2,3,5,6,8,9, 10 1.3.4.5.7.9, 10, 11 2,4,5,7,8,9, 10.11 1,2,3,5,6.8,9, 11 1,3,4,5,8,9, 10, 11 2,4,6,7,8,9, 10, 11
1.2.3.5.6.8, 10, 11 1,3,4,6,7,8,9, 10 2,5,6,7,8,9, 10, 11
1.2.3.5.6.9, 10, 11 1,3,4,6,7,8,9, 11 3,4,5,6,7,8,9, 10 1,2,3,5,7,8,9, 10 1.3.4.6.7.8, 10, 11 3,4,5,6,7,8,9, 11 1,2,3,5,7,8,9, 11 1.3.4.6.7.9, 10.11 3.4.5.6.7.8, 10, 11
1.2.3.5.7.8, 10, 11 1,3,4,6,8,9, 10, 11 3.4.5.6.7.9, 10, 11
1.2.3.5.7.9, 10, 11 1,3,4,7, 8,9, 10, 11 3,4,5,6,8,9, 10, 11 1,2,3,5,8,9, 10, 11 1,3,5,6,7,8,9, 10 3,4,5,7,8,9, 10, 11 1,2,3,6,7,8,9, 10 1,3,5,6,7,8,9, 11 3,4,6,7.8,9, 10, 11 1,2,3,6,7,8,9, 11 1.3.5.6.7.8, 10, 11 3,5,6,7,8,9, 10, 11
1.2.3.6.7.8, 10, 11 1.3.5.6.7.9, 10, 11 4,5,6,7,8.9, 10, 11
1.2.3.6.7.9, 10, 11 1.3,5,6,8,9, 10, 11 1,2,3,4,5,6,7,8,9 1,2,3,6,8,9, 10, 11 1,3,5,7,8,9, 10, 11 1,2,3,4,5,6,7,8, 10 1,2,3,7,8,9, 10, 11 1.3,6,7.8,9, 10, 11 1.2.3.4.5.6.7.8, 11 1,2,4,5,6,7,8,9 1,4,5,6,7,8,9,10 1.2.3.4.5.6.7.9, 10 1,2,4,5,6,7,8, 10 1,4,5,6,7,8,9, 11 1,2,3,4,5,6,7,9, 11
1.2.4.5.6.7.8, 11 1.4.5.6.7.8, 10, 11 1.2.3.4.5.6.7, 10, 11
1.2.4.5.6.7.9, 10 1^4,5,6,7,9, 10, 11 1,2,3,4,5,6,8.9, 10 1,2,4,5,6,7,9, 11 1.4.5.6.8.9, 10, 11 1,2,3,4,5,6,8,9, 11
1.2.4.5.6.7, 10, 11 1,4,5,7,8,9, 10, 11 1.2.3.4.5.6.8, 10, 11 1, 2, 4, 5, 6, 8, 9, 10 1,4,6,7,8,9, 10, 11 1.2.3.4.5.6.9, 10, 11 1,2,4,5,6,8,9, 11 1,5,6,7,8,9, 10, 11 1,2,3,4,5,7,8.9.10
1.2.4.5.6.8, 10, 11 2, 3, 4, 5, 6, 7, 8, 9 1,2,3,4,5,7,8,9, 11
1.2.4.5.6.9, 10, 11 2,3,4,5,6,7, 8, 10 1.2.3.4.5.7.8, 10, 11 1,2,4,5,7,8,9, 10 2, 3v 4, 5, 6, 7, 8, 11 1.2.3.4.5.7.9, 1O311 1,2,4,5,7,8,9, 11 2,3,4,5,6,7,9, 10 1,2,3,4,5,8,9, 10, 11
1.2.4.5.7.8, 10, 11 2,3,4,5,6,7,9, 11 1,2,3,4,6,7,8,9, 10
1.2.4.5.7.9, 10, 11 2.3.4.5.6.7, 10, 11 1,2,3,4,6,7,8,9, 11 1,2,4,5,8,9, 10, 11 2,3,4,5,6,8,9, 10 1.2.3.4.6.7.8, 10, 11 1,2,4,6,7,8,9, 10 2,3.4,5,6,8,9, 11 1.2.3.4.6.7.9, 10, 11 1,2,4,6,7,8,9, 11 2.3.4.5.6.8, 10, 11 1,2,3,4,6,8,9, 10, 11
1.2.4.6.7.8, 10, 11 2.3.4.5.6.9, 10, 11 1,2,3,4,7,8,9, 10, 11
1.2.4.6.7.9, 10, 11 2,3,4,5,7,8,9, 10 1,2,3,5,6,7,8,9, 10 1,2,4,6,8,9, 10, 11 2,3,4,5,7,8,9, 11 1,2,3,5,6,7,8,9, 11 1,2,4,7,8,9, 10, 11 2.3.4.5.7.8, 10, 11 1.2.3.5.6.7.8, 10, 11 1,2,5,6,7,8,9, 10 2.3.4.5.7.9, 10, 11 1.2.3.5.6.7.9, 10, 11 1,2,5,6,7,8,9, 11 2.3,4,5.8,9, 10, 11 1,2,3,5,6,8,9, 10, 11
1.2.5.6.7.8.10.11 2,3,4,6.7,8,9, 10 1,2,3,5,7,8,9, 10, 11
1.2.5.6.7.9.10.11 2.3,4,6.7,8,9, 11 1,2.3,6,7,8,9, 10, 11 1,2,5,6,8,9, 10, 11 2.3.4.6.7.8, 10, 11 1,2,4,5,6,7,8,9, 10 1,2.5,7,8.9, 10, 11 2.3.4.6.7.9, 10, 11 1,2,4,5,6,7,8.9.11 1,2.6,7,8,9, 10, 11 2^3,4,6,8,9, 10, 11 1.2.4.5.6.7.8, 10, 11 1,3,4,5,6,7,8,9 2,3,4,7,8,9, 10, 11 1.2.4.5.6.7.9, 10, 11 1,3,4,5,6,7,8, 10 2,3,5,6,7,8,9, 10 1,2,4,5,6,8,9, 10, 11
1.3.4.5.6.7.8, 11 2, 3, 5, 6, 7, 8, 9, 11 1,2,4,5,7,8,9, 10, 11
1.3.4.5.6.7.9, 10 2,3,5^6,7,8, 10, 11 1,2,4,6,7,8,9, 10, 11 1,3,4,5,6,7,9, 11 2,3,5,6,7,9, 10, 11 1,2,5,6,7,8,9, 10, 11
1.3.4.5.6.7, 10, 11 2,3,5,6,8,9, 10, 11 1,3,4,5,6,7,8,9, 10 1,3,4,5.6,8,9, 10 2,3,5,7,8,9, 10, Il 1,3,4,5,6,7,8,9, 11 1,3,4,5,6,8,9, 11 2,3.6,7,8,9, 10, 11 1.3.4.5.6.7.8, 10, 11
1.3.4.5.6.8, 10, 11 2,4.5,6,7,8,9, 10 1.3.4.5.6.7.9, 10, 11
1.3.4.5.6.9, 10, 11 2,4,5,6,7,8,9, 11 1,3,4,5,6,8,9, 10, 11 1,3,4.5,7,8,9,10,11 2,3,4.5,7,8.9.10,11 1,2,3,4,5.6,8,9,10,11
1,3.4.6,7,8.9,10,11 2,3,4.6,7,8.9.10,11 1,2,3,4,5.7,8,9,10,11
1,3.5,6,7,8.9,10,11 2,3,5.6,7,8.9.10,11 1,2,3,4.6,7,8,9,10,11
1,4.5,6,7,8.9,10,11 2,4.5.6,7,8.9,10,11 1,2,3.5.6,7,8,9,10,11
2.3.4.5.6.7.8.9.10 3,4,5,6,7,8.9.10,11 1,2,4.5.6,7,8,9,10,11
2.3.4.5.6.7.8.9.11 1,2.3.4,5,6,7,8,9,10 1,3.4,5.6,7,8,9,10.11,3,4.5.6,7, 8.10.11 1,2.3,4,5,6.7.8,9, 11 2,3.4,5,6,7,8,9.10.11,3,4.5.6,7.9.10, 11 1,2.3,4,5,6,7,8, 10.11 1,2.3,4,5,6,7,8.9.10, 11, 3, 4.5.6, 8.9.10, 11 1, 2.3, 4, 5, 6, 7, 9, 10.11

Claims

CLAIMS What is claimed is:
1. A computer-implemented method for feature selection, comprising: obtaining a dataset comprising data having at least two features and at least two classes of information: computing a margin for each sample in the dataset through a probabilistic model; estimating feature weights globally within a large margin framework through iteration using the computed margin until convergence; and outputting the estimated feature weights.
2. The method according to claim 1, wherein computing the margin for each sample in the dataset through the probabilistic model comprises: defining a margin for a training dataset of the obtained dataset; and computing the margin for the training dataset with respect to a feature weight vector.
3. The method according to claim 2, wherein defining the margin for the training dataset comprises determining a first nearest neighbor and a second nearest neighbor of each sample, wherein the first nearest neighbor is from the same class as the sample and the second nearest neighbor is from a different class as the sample.
4. The method according to claim 3, wherein the training dataset D is defined as D = {(xll,y)}^_ι <z RJ x {±1} , where N represents the number of samples taken from the dataset and each sample is represented by J features, where xn is the «-th data sample and >„ is its corresponding class label; wherein defining the margin for the training dataset comprises: determining the margin pn of each data sample x,, using a distance function, where
Pn = d(xn, NM(Xn )) - d(xn . NH(Xn)) , where NM is the second nearest neighbor from the different class, NH is the first nearest neighbor, and d(-) is the distance function.
5. The method according to claim 4, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: parameterizing the margin for the training set using a nonnegative vector w, where ρn (w) represents the margin of xπ, computed with respect to w and is given by yo,(w) = w7 zn , where zn = |xn - NM(xB )| -|xrt - NH(x,, )\ .
6. The method according to claim 4, wherein the data of the dataset comprise at least three classes of information, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: parameterizing the margin for the training set using a nonnegative vector w, where pn (w) represents the margin of Xn, computed with respect to w and is given by
A,(w) = - rf(xw, _Vi7(xB)|w)
Figure imgf000078_0001
= min d(xn,xt w) - d(xn, NHnUv) x,eD\Dv where Y is the set of class labels, NM(c)(xB) is the nearest neighbor of xn from class c. and Dc is a subset of D containing only samples from class c.
7. The method according to claim 2, wherein defining the margin for the training dataset comprises treating nearest neighbors of each sample as latent variables, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: taking the expectation of the defined margin by averaging out any latent variables.
8. The method according to claim 7, wherein the training dataset D is defined as D = ((x,,,}-',,)},'^, a R ' x {±l) , where N represents the number of samples taken from the dataset and each sample is represented b> J features, where Xn is the n-th data sample and y,, is its corresponding class label; wherein taking the expectation of the defined margin comprises: averaging out the latent variables in pn (w) , which represents the margin of Xn computed with respect to a nonnegative vector w, wherein the averaging out of the latent variables of pn (w) is given by pn (w) = wJ Zn , where z« = (x, = W/(xj|w) x.
Figure imgf000079_0001
Mn = {i : l ≤ i ≤ N, y, ≠ yn } , Hn = {i : l ≤ i ≤ N, y, = >„ , i ≠ n} , wherein P(x, = NM {xn )|w) and P(x,
Figure imgf000079_0002
are the probabilities that sample x, is the nearest miss (NM) or nearest hit (NH) of Xn, respectively.
9. The method according to claim 8, wherein the probabilities are estimated through a standard kernel density estimation method.
10. The method according to claim 2, wherein defining the margin for the training dataset provides a scaled feature space, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: linking the scaled feature space to a feature weight vector.
11. The method according to claim 2. wherein defining the margin for the training dataset provides a scaled feature space, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: parameterizing the scaled feature space using a nonnegative vector to construct a weighted feature space, where the magnitude of each element of the nonnegative vector reflects relevance of a corresponding feature in a learning process.
12. The method according to claim 1 , wherein estimating the feature weights globally within the large margin framework through iteration using the computed margin until convergence comprises:
(a) setting an initial value for a feature weight of the computed margin;
(b) solving the computed margin using the initial value for the feature weight; (c) updating the computed margin by solving an optimization function from the computed margin using the initial value;
(d) solving the computed margin using a next value for the feature weight;
(e) updating the computed margin by solving the optimization function from the computed margin using the next value; and
(f) repeating steps (d) and (e) until convergence.
13. The method according to claim 12, wherein solving the optimization function comprises determining a solution of a vector v from
v <— v - η λl - 1 v , where <E> is a Hadamard
Figure imgf000080_0001
operator, and η is a learning rate, for W1 = v] , 1 </ < J, where w} is ay-th feature weight of a feature weight vector of the computed margin, where J a total number of features of the data of the dataset, and where - x,| , where
Figure imgf000080_0002
Mn = {i : l ≤ i ≤ N,y, ≠ yn} , Hn = {/ : 1 < / < N, y, = yn, i ≠ n} , P(x, = NM (x j|w) is a probability that sample x, is the nearest miss (NM) of Xn, and P(X1 = NH(x) w) is a probability that sample x, is the nearest hit (NH) of xn, where w is the feature weight vector for the computed margin, where Xn is the n-th data sample of the dataset having N samples, and >„ is its corresponding class label.
14. The method according to claim 1, wherein obtaining the dataset comprises obtaining gene expression data from a microarray, the method further comprising identifying a group of genes or profile for detecting or prognosticating a disease or condition using the estimated feature weights.
15. "I he method according to claim 1. wherein obtaining the dataset comprises obtaining data from an image or sound recording of a physical object, the method further comprising identifying the image or sound recording for a pattern or object recognition program using the estimated feature weights.
16. Λ computer-readable medium having computer-executable instructions for performing the method of any one of claims 1-15.
17. A computer-readable medium having computer-executable components executing instructions for performing the method of any one of claims 1-15.
18. A computer- implemented method for identifying relevant information within a dataset comprising data having at least two features and at least two classes of information, the method comprising: selecting a feature subset from the at least two features of the datasel for processing in a learning machine by executing a feature selection algorithm on the dataset to identify features belonging to the subset, wherein executing the feature selection algorithm comprises: determining a margin for each sample in the dataset by local learning through a probabilistic model; estimating feature weights globally within a large margin framework through iteration using the determined margin until convergence: and applying the converged estimated feature weights to the dataset; processing the feature subset to identify the relevant information; and outputting the relevant information for display.
19. The method according to claim 18, wherein the learning machine is a support λ'ector machine.
20. The method according to claim 18, wherein the learning machine utilizes a logistic regression formulation.
21. The method according to claim 18, wherein the dataset comprises gene expression data obtained from a microarray and the relevant information comprises identities of a group of genes or profile for detecting or prognosticating a disease or condition.
22. The method according to claim 18, wherein the dataset comprises data obtained from an image or sound recording of a physical object, and the relevant information comprises an identity of the image or sound recording for a pattern or object recognition program.
23. The method according to claim 18, wherein determining the margin for each sample in the dataset by local learning through the probabilistic model comprises: defining a margin for a training dataset of the dataset; and computing the margin for the training dataset with respect to a feature weight vector.
24. The method according to claim 23, wherein the training dataset D is defined as D
Figure imgf000082_0001
c RJ χ {±l} , where N represents the number of samples taken from the dataset and each sample is represented by J features, where Xn is the «-th data sample and yn is its corresponding class label; wherein defining the margin for the training dataset comprises: determining the margin pn of each data sample xn using a distance function for a first nearest neighbor and a second nearest neighbor of each sample, wherein the first nearest neighbor is from the same class as the sample and the second nearest neighbor is from a different class as the sample, where pn = d(xn, NM(xn )) - d(xn , NH(Xn)) , where NM is the second nearest neighbor from the different class, NH is the first nearest neighbor, and d(-) is the distance function.
25. The method according to claim 24, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: parameterizing the margin for the training set using a nonnegative vector w, where pn(w) represents the margin of xn, computed with respect to w and is given by pn(w) = wrzn , where z,, = |xH - NM(xn )\ -|xB - NH(xn ) .
26. The method according to claim 24, wherein the data of the dataset having at least three classes of information, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: parameterizing the margin for the training set using a nonnegative vector w, wrhere pn (w) represents the margin of xn, computed with respect to w and is given by
/?„ (w) - d(xn,
Figure imgf000083_0002
Figure imgf000083_0001
= - d(xn, NHnIw)
Figure imgf000083_0003
where Y is the set of class labels,
Figure imgf000083_0004
is the nearest neighbor of Xn from class c, and Dc is a subset of D containing only samples from class c.
27. The method according to claim 23, wherein the training dataset D is defined as D = {(xn,yn)}^x c RJ x {+l} , where N represents the number of samples taken from the dataset and each sample is represented by J features, where Xn is the «-th data sample and yn is its corresponding class label; and wherein defining the margin for the training dataset comprises treating nearest neighbors of each sample as latent variables, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: averaging out the latent variables in pn (w) , which represents the margin of xn computed with respect to a nonnegative vector w, wherein the averaging out of the latent variables of pn (w) is given by pn (w) = w7 In . where
Figure imgf000083_0005
)|w)k -x, I -
Figure imgf000083_0006
= ΛW(xj|w)|xπ -x, I ,
Mn = {/ : 1 < i ≤ N, y, ≠ yj , Hn = {i : \ ≤ i ≤ N, y, = yn , i ≠ n) , wherein P(X1 = NM(xn )|w) and P(X1 = NH(xn)w) are the probabilities that sample x, is the nearest miss (NM) or nearest hit (Nil) of xn. respectively.
28. The method according to claim 27, wherein the probabilities are estimated through a standard kernel density estimation method.
29. The method according to claim 23, wherein defining the margin for the training dataset provides a scaled feature space, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: linking the scaled feature space to a feature weight vector.
30. The method according to claim 23, wherein defining the margin for the training dataset provides a scaled feature space, wherein computing the margin for the training dataset with respect to the feature weight vector comprises: parameterizing the scaled feature space using a nonnegative vector to construct a weighted feature space, where the magnitude of each element of the nonnegative vector reflects relevance of a corresponding feature in a learning process.
31. The method according to claim 18, wherein estimating the feature weights globally within the large margin framework through iteration using the computed margin until convergence comprises:
(a) setting an initial value for a feature weight of the computed margin;
(b) solving the computed margin using the initial value for the feature weight;
(c) updating the computed margin by solving an optimization function from the computed margin using the initial value;
(d) solving the computed margin using a next value for the feature weight;
(e) updating the computed margin by solving the optimization function from the computed margin using the next value; and
(!) repeating steps (d) and (c) until convergence.
32. The method according to claim 31, wherein solving the optimization function comprises determining a solution of a vector v from V - T/ Ai - where ® is a IIadamard
Figure imgf000085_0001
operator, and η is a learning rate, for W1 = V1 , \ <j < J, where Wj is a/-th feature weight of a feature weight vector of the computed margin, where J a total number of features of the data of the dataset, and where z« - x,| , where
Figure imgf000085_0002
MΛ = {i : l ≤ i ≤ N, y, ≠ yn } , Hn = {i : \ < i < N, y, = yn j ≠ n] , P(x, = _VM(xw )|w) is a probability that sample x, is the nearest miss (NM) of x«, and P(x( = M/(xM)|w) is a probability that sample x; is the nearest hit (NH) of x,,, where w is the feature weight vector for the computed margin, where xn is the n-th data sample of the dataset having N samples, and>>w is its corresponding class label.
33. A computer-readable medium having computer-executable instructions for performing the method of any one of claims 18-32.
34. A computer-readable medium having computer-executable components executing instructions for performing the method of any one of claims 18-32.
35. An article of manufacture comprising a plurality of individual gene-product detectors wherein: each individual gene-product detector is directed specifically to detection of product(s) of a particular gene; and together the plurality of gene-product detectors are directed to detection of product(s) of a set of genes, said set of genes consisting of (a) any combination of two or more genes selected from the group consisting of LOC58509, CEGPl, AL080059, ATP5E, and PRAME, and (b) up to 2000 additional genes.
36. The article of manufacture according to claim 35, wherein the combination of two or more genes selected from the group consisting of LOC58509. CEGPl , AL080059, Λ1 P5E, and FRAME is:
LOC58509 and CEGPl :
LOC58509 and AL080059;
LOC58509 and ATP5E:
LOC58509 and FRAME;
CEGPl and AL080059;
CEGPl and ATPSE;
CEGPl and PRAME:
AL080059 and ATP5E;
AL080059 and PRAME;
ATP5E and PRAME:
LOC58509, CEGPl and AL080059;
LOC58509, CEGPl and ATP5E;
LOC58509, CEGPl and PRAME;
LOC58509, AL080059 and ATP5E;
LOC58509, AL080059 and PRAME;
LOC58509, ATP5E and PRAME;
CEGPl, AL080059 and ATP5E:
CEGPl, AL080059 and PRAME;
CEGPl, ATP5E and PRAME:
ΛL080059, AIT5E and PRAME;
LOC58509. CEGPl, AL080059 and ATP5E;
LOC58509. CEGPl, AL080059 and PRAME;
LOC58509. CEGP l , ATP5E and PRAMF:
LOC58509. ΛL080059, ΛΪP5E and PRAME;
CEGPl, AL080059, ATP5E and PRAME; or
LOC58509. CEGPl. AL080059, ATP5E and PRAME.
37. The article of manufacture according to any of claims 35-36 wherein the number of additional genes is 2000 or less, 1500 or less, 1000 or less, 500 or less, 250 or less, 100 or less, 50 or less, 25 or less, 10 or less. 5 or less, or none.
38. The article of manufacture according to any of claims 35-37 wherein the combination oftwo or more genes selected from the group consisting of LOC58509, CEGPl, AL080059, ΛFP5E. and PRAME is
LOC58509, CEGPl, AL080059 and ATP5E; or LOC58509, CEGPl, AL080059. ATP5E and PRAME: and the number of additional genes is 500 or less.
39. An article of manufacture comprising a plurality of individual gene-product detectors wherein: each individual gene-product detector is directed specifically to detection of product(s) of a particular gene; and together the plurality of gene-product detectors are directed to detection of product(s) of a set of genes, said set of genes consisting of (a) any combination of two or more genes selected from the group consisting of PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3, FUT7, RICS Rho, MAP4K4, CUTLl. and ZNF324B. and (b) up to 2000 additional genes.
40. The article of manufacture according to claim 39 wherein the combination of two or more genes selected from the group consisting of PAK3. RPL23, EI24, TGFB3. RBM34, PCOLN3, FUT7. RlCS Rho, MAP4K4, CUTLl, and ZNF324B is: any combination oftwo genes represented in the Nucleotide Combinations table, any combination of three genes represented in the Nucleotide Combinations table, any combination of four genes represented in the Nucleotide Combinations table, any combination of five genes represented in the Nucleotide Combinations table, any combination of six genes represented in the Nucleotide Combinations table, any combination of seven genes represented in the Nucleotide Combinations table, any combination of eight genes represented in the Nucleotide Combinations table, any combination of nine genes represented in the Nucleotide Combinations table. any combination of ten genes represented in the Nucleotide Combinations table, or the combination of eleven genes represented in the Nucleotide Combinations table.
41. The article of manufacture according to any of claims 39-40 wherein the number of additional genes is 2000 or less. 1500 or less, 1000 or less, 500 or less, 250 or less, 100 or less, 50 or less, 25 or less, 10 or less, 5 or less, or none
42. The article of manufacture according to any of claims 39-41 wherein the combination of two or more genes selected from the group consisting of PAK3, RPL23. EI24, TGt B3, RBM34, PCOLN3, FUT7, RICS Rho, MΛP4K4, CUTLl, and ZNF324B is
RPL23, EI24, TGFB3, RBM34, PCOLN3, FUT7. RICS Rho, MAP4K4, CUTLl. and ZNF324B;
PΛK3, EI24, TGFB3, RBM34, PCOLN3, FUT7, RICS Rho, MAP4K4, CUTLl, and ZNF324B;
PAK3, RPL23, TGFB3, RBM34, PCOLN3, FUT7. RICS Rho, MAP4K4, CUTLl, and ZNF324B;
PAK3, RPL23, EI24, RBM34, PCOLN3. FUT7, RlCS Rho, MAP4K4, CUTLl, and ZNF324B;
PAK3, RPL23, EI24, TGFB3, PCOLN3, FUT7, RICS Rho, MAP4K4, CUTLl, and ZNF324B;
PAK3, RPL23, EI24, TGFB3, RBM34. FUT7, RICS Rho, MAP4K4. CUTLl, and ZNF324B;
PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3, RICS Rho, MAP4K4, CUTLl , and ZNF324B;
PAK3, RPL23, EI24, TGFB3, RBM34. PCOLN3, FUT7, MAP4K4, CUTLl, and ZNF324B:
PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3, FUT7, RICS Rho. CUTLl, and ZNF324B;
PAK3, RPL23, EI24, TGFB3, RBM34. PCOLN3, FUT7, RICS Rho. MAP4K4, and ZNF324B; PAK3, RPL23, ET24, TGFB3, RBM34, PCOLN3, FUT7, RICS Rho, MAP4K4, and CUΪLl ; or
PAK3, RPL23, EI24, TGFB3, RBM34. PCOLN3, FUT7, RICS Rho, MΛP4K4, CUTLl , and ZNF324B; and the number of additional genes is 1000 or less.
43. The article of manufacture according to any of claims 39-42 wherein the combination of two or more genes selected from the group consisting of PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3, FUT7, RICS Rho, MAP4K4, CUTLl , and ZNF324B includes any of the genes except PAK3; any of the genes except RPL23; any of the genes except EI24; any of the genes except TGFB3; any of the genes except RBM34; any of the genes except PCOLN3; any of the genes except FUT7; any of the genes except RICS Rho; any of the genes except MAP4K4: any of the genes except CUTLl ; or any of the genes except ZNF324B.
44. An article of manufacture comprising a plurality of individual gene-product detectors wherein: each individual gene-product detector is directed specifically to detection of product(s) of a particular gene; and together the plurality of gene -product detectors are directed to detection of product(s) of a set of genes, said set of genes consisting of (a) any combination of two or more genes selected from the group consisting of TGFB3, PAK3, RBM34, RPL23, and E124, and (b) up to 2000 additional eenes.
45. The article of manufacture according to claim 44 wherein the combination of two or more genes selected from the group consisting of TGFB3, PAK3, RBM34, RPL23, and EI24 is:
TGFB3 and PAK3;
TGFB3 and RBM34;
TGFB3 and RPL23:
TGFB3 and EI24;
PAK3 and RBM34:
PAK3 and RPL23;
PΛK3 and EI24;
RBM34 and RPL23;
RBM34 and EI24;
RPL23 and EI24;
TGFB3, PAK3 and RBM34;
TGFB3, PAK3 and RPL23;
TGFB3. PAK3 and EI24;
TGFB3, RBM34 and RPL23;
TGFB3. RBM34 and EI24;
TGFB3, RPL23 and EI24:
PAK3, RBM34 and RPL23;
PAK3, RBM34 and E124;
PAK3, RPL23 and EI24;
RBM34, RPL23 and EI24;
TGFB3, PΛK3, RBM34 and RPL23;
TGFB3, PAK3, RBM34 and E124:
TGFB3, PAK3, RPL23, EI24;
TGFB3, RBM34, RPL23 and EI24;
PAK3. RBM34. RPL23 and EI24; or
TGFB3. PAK3. RBM34, RPL23 and EI24.
46. The article of manufacture according to any of claims 44-45 wherein the number of additional genes is 2000 or less, 1500 or less, 1000 or less, 500 or less, 250 or less, 100 or less, 50 or less. 25 or less, 10 or less, 5 or less, or none.
47. The article of manufacture according to any of claims 44-46 wherein the combination of two or more genes selected from the group consisting of TGFB3. PAK3. RBM34, RPL23, and EI24 is:
TGFB3, PAK3, RBM34 and RPL23; TGFB3, PAK3, RBM34 and E124; 1 GFB3, PAK3, RPL23, E124; TGFB3, RBM34, RPL23 and EI24; PAK3, RBM34, RPL23 and EI24; or TGFB3, PAK3, RBM34. RPL23 and EI24;
and the number of additional genes is 500 or less.
48. The article of manufacture according to any of claims35-47 wherein the gene- product detectors are antibodies or polynucleotide probes.
49. The article of manufacture according to any of claims35-47 wherein the gene- product detectors are polynucleotide probes that hybridize with the targeted gene product(s).
50. A method of assigning a prognosis class to a patient having breast cancer comprising: a) obtaining data relating to said patient, wherein the data includes a gene expression profile for a plurality of genes (i) comprising LOC58509, CEGPl, AL080059, ΛTP5E, and PRAME, or (ii) consisting of any two. three, four, or five of EOC58509, CEGPl, AL080059, ATP5E, and PRAME, along with up to 0, 5. 10, 25, 50, 100, 250. 500. 1000, 1500, or 2000 additional genes; and b) classifying the patient as belonging to a particular prognosis class based upon a subset of the gene expression profile, wherein the subset (i) comprises gene expression levels for LOC58509, CEGPl, AL080059, ATP5E. and PRAME. or (ii) consists of gene expression levels for any two, three, four, or five of LOC58509, CEGPl, AL080059, ATP5E. and PRΛME, along with up to 0, 5, 10, 25. 50. 100, 250. 500, 1000, 1500, or 2000 additional genes; and wherein the prognosis class relates to risk of recurrence of breast cancer.
51. The method of claim 50, wherein said classifying is based on a subset of the gene expression profile consisting of gene expression levels for (i) LOC58509, CEGPl, AL080059, ATP5E, and PRAME or (ii) LOC58509, CEGPl, AL080059, and A1T5E.
52. A method of providing treatment to a breast cancer patient comprising: a) obtaining a biological sample from the breast cancer patient; b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes (i) comprising LOC58509, CEGPl, AL080059, ATP5E. and PRAME, or (ii) consisting of any two, three, four, or five of LOC58509. CEGPl, AL080059, ATP5E. and PRAME, along with up to 0, 5, 10, 25. 50, 100, 250, 500, 1000, 1500. or 2000 additional genes; c) classifying the patient as belonging to a prognosis class based upon a subset of the gene expression profile, wherein the subset (i) comprises gene expression levels for LOC58509, CEGPl, AL080059, ATP5E, and PRAME, or (ii) consists of gene expression levels for any two. three, four, or five of LOC58509. CEGPl, AL080059, ATP5E. and PRAME, along with up to 0, 5, 10, 25, 50, 100, 250, 500, 1000, 1500, or 2000 additional genes; and d) providing treatment to the breast cancer patient based wholly or in part on the patient's prognosis class.
53. The method of claim 52, wherein said classifying is based on a subset of the gene expression profile consisting of gene expression levels for (i) LOC58509, CEGPl, AL080059, ATP5E. and PRAME or (ii) LOC58509. CEGPl , AL080059, and ATP5E.
54. An article of manufacture comprising a plurality of individual means for detecting product(s) of a gene wherein: a) each individual means for detecting is directed specifically to detection of product(s) of a particular gene, and b) the plurality of individual means for detecting the product(s) of a gene are directed specifically to detection of gene product(s) of (i) any two, three, four, or five of LOC58509. CEGPl, AL080059. ATP5E. and PRΛME, and (ii) up to 0, 5, 10, 25, 50, 100, 250, 500. 1000, 1500, or 2000 additional genes.
55. The article of manufacture of claim 54, wherein the plurality of individual means for detecting the product(s) of a gene are directed specifically to detection of gene product(s) of LOC58509, CEGPl , AL080059, ATP5E, and PRAME and up to 500 additional genes.
56. The article of manufacture or an apparatus according to any of claims 54-55. wherein each individual means for detecting is (a) a set of one or more components that facilitates a specific method of analysis, or (b) a polynucleotide probe or a plurality of polynucleotide probes, optionally immobilized on a solid support, said probes comprising LOC58509, CEGPl, AL080059, ATP5E and. optionally, PRAME or fragments thereof.
57. A kit comprising from one to five containers, each container comprising a buffer and at least one polynucleotide encoding LOC58509, CEGPl, AL080059, ATP5E and, optionally, PRAME or fragments of said polynucleotide that hybridize with gene products of LOC58509, CEGPl. AL080059, ATP5E and, optionally, PRAME.
58. lhe kit according to claim 57, wherein said kit contains fewer than five containers and each container comprises a buffer and a combination of any two. three, four, or fϊ\e polynucleotides: a) encoding LOC58509, CEGPl , AL080059, ATP5E and, optionally. PRAME; or b) that hybridize with LOC58509, CEGP l , AL080059. ATP5E and. optionally, PRAME.
59. The kit according to claim 58, wherein said combination of polynucleotides is: LOC58509 and CEGPl ; LOC58509 and AL080059;
LOC58509 and ΛTP5E;
LOC58509 and PRAME;
CEGPl and AL080059;
CEGPl and ATP5E;
CEGPl and PRAME;
AL080059 and ATP5E;
AL080059 and FRAME;
ATP5E and PRAME;
LOC58509, CEGPl and AL080059;
LOC58509, CEGPI and ATP5E:
LOC58509, CEGPl and PRAME;
LOC58509, AL080059 and ATP5E;
LOC58509, AL080059 and PRAME;
LOC58509, ATP5E and PRAME;
CEGPl, AL080059 and ATP5E;
CEGPl, AL080059 and PRAME;
CEGPl, ATP5E and PRAME;
AL080059, ATP5E and PRAME;
LOC58509, CEGPl, AL080059 and ATP5E;
LOC58509, CEGPl, AL080059 and PRAME;
LOC58509, CEGPl, ATP5E, PRAME;
EOC58509, AL080059, AΪP5E and FRAME;
CEGPl. AL080059, ATP5E and PRAME; or
LOC58509, CEGPU AL080059, ATP5E and PRAME.
60. A method of providing a prognosis for a patient comprising determining the expression level(s) for one or more genes and assigning a prognosis to said patient on the basis of the expression levels of said genes, wherein said genes and expression levels arc:
Figure imgf000095_0001
or
Figure imgf000095_0002
wherein an expression profile as set forth in either of the above tables indicates that the patient has an increased likelihood for a recurrence of cancer.
61. The method according to claim 60, wherein a patient identified as having a bad prognosis is subjected to increased surveillance for the recurrence of cancer.
62. The method according to claim 61 , wherein said increased surveillance for the recurrence of cancer comprises increasing the frequency of testing for prostate specific antigen (PSA), body scans, or digital rectal examination.
63. l he method according to claim 60, 61 , or 62. wherein the following gene expression profile is used to determine the patient's prognosis:
Figure imgf000096_0001
64. The method according to claim 60, 61 , or 62, wherein the following gene expression profile is used to determine the patient's prognosis:
Figure imgf000096_0002
65. The method according to claim 60, 61, or 62, further comprising the use of a postoperative nomogram evaluation of the patient for the assessment of the patient's prognosis in combination with the following expression profile:
Figure imgf000096_0003
Figure imgf000097_0001
or
Figure imgf000097_0002
66. A method of assigning treatment to a prostate cancer patient comprising: a) assigning a prognosis class to the patient in accordance with any one of claims 48-53; and b) providing treatment to the prostate cancer patient.
67. A kit comprising from one to eleven containers, each container comprising a buffer and at least one polynucleotide selected from SEQ ID NO: 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 or 16 or polynucleotides that hybridize with SEQ ID NO: 6, 7, 8, 9. 10, 11, 12, 13, 14, 15 or 16.
68. The kit according to claim 67, wherein said kit contains fewer than eleven containers and each container comprises a buffer and a combination of any two. three, four, five, six, seven, eight, nine, ten or 1 1 polynucleotides as set forth in the Nucleotide Combinations Table.
69. A kit comprising from one to five containers, wherein each container comprises a buffer and any combination of two, three, four or five polynucleotides or polynucleotides that hybridize thereto, wherein said combination of polynucleotides is: TGFB3 and PAK3;
ΪGFB3 and RBM34;
TGFB3 and RPL23;
TGFB3 and E124;
PAK3 and RBM34;
PΛK3 and RPL23;
PAK3 and EI24;
RBM34 and RPL23;
RBM34 and E124;
RPL23 and EI24;
TGFB3, PAK3 and RBM34;
TGFB3, PAK3 and RPL23;
TGFB3. PAK3 and EI24;
TGFB3, RBM34 and RPL23;
TGFB3, RBM34 and E124;
TGFB3, RPL23 and EI24;
PAK3, RBM34 and RPL23;
PAK3, RBM34 and EI24;
PAK3, RPL23 and E124;
RBM34, RPL23 and EI24;
TGFB3, PAK3, RBM34 and RPL23;
TGFB3, PAK3, RBM34 and E124;
TGFB3, PAK3, RPL23, EI24;
TGFB3, RBM34, RPE23 and EI24;
PAK3, RBM34, RPL23 and EΪ24; or
TGFB3, PAK3, RBM34, RPL23 and E124.
70. A method of assigning a prognosis class to a patient having prostate cancer comprising: a) obtaining data relating to said patient, wherein the data includes a gene expression profile for a plurality of genes (i) comprising PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3. FUT7, RICS Rho. MAP4K4, CUTLl, and ZNF324B. or (ii) consisting of any two, three, four, five, six, seven, eight, nine, ten, or eleven of PAK3, RPL23, ET24, TGFB3, RBM34, PCOLN3, FUT7, RTCS Rho. MΛP4K4. CUTLl, and ZNF324B, along with up to 0, 5. 10. 25, 50, 100, 250, 500, 1000. 1500, or 2000 additional genes; and b) classifying the patient as belonging to a particular prognosis class based upon a subset of the gene expression profile and optionally based also on a post-operative nomogram, wherein the subset (i) comprises gene expression levels for PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3. FUT7. RICS Rho, MAP4K4. CUTLl, and ZNF324B, or (ii) consists of gene expression levels for any two, three, four, five, six, seven, eight, nine, ten, or eleven of PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3. FUT7. RICS Rho, MΛP4K4, CUTLl, and ZNF324B, along with up to 0. 5, 10, 25, 50. 100, 250, 500, 1000, 1500, or 2000 additional genes; and wherein the prognosis class relates to risk of recurrence of prostate cancer.
71. The method of claim 70, wherein said classifying is based on a subset of the gene expression profile consisting of gene expression levels for PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3. FUT7, RICS Rho. MAP4K4, CUTLl, and ZNF324B.
72. A method of providing treatment to a prostate cancer patient comprising: a) obtaining a biological sample from the prostate cancer patient; b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes (i) comprising PAK3, RPL23, E124. TGFB3, RBM34. PCOLN3, FUT7, RICS Rho, MAP4K4, CUTLl. and ZNF324B, or (ii) consisting of any two, three, four, five, six. seven, eight, nine, ten, or eleven of PAK3, RPL23, EΪ24, TGFB3, RBM34, PCOLN3. FUT7. RICS Rho, MAP4K4, CUTLl. and ZNF324B, along with up to 0, 5, 10, 25, 50, 100, 250, 500, 1000, 1500, or 2000 additional genes; c) classifying the patient as belonging to a prognosis class based upon a subset of the gene expression profile and optionally based also on a post-operative nomogram, wherein the subset (i) comprises gene expression levels for PΛK3, RPL23, EI24, TGFB3, RBM34, PCOLN3, FUI7, RICS Rho, MΛP4K4, CUTLl, and ZNI-324B. or (ii) consists of gene expression levels for any two, three, four, five, six, seven, eight, nine, ten, or eleven of PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3, FUT7, RTCS Rho, MAP4K4, CUTLl, and ZNF324B, along with up to 0, 5, 10. 25, 50, 100, 250, 500, 1000. 1500. or 2000 additional genes; and d) providing treatment to the prostate cancer patient based wholly or in part on the patient's prognosis class.
73. The method of claim 72, wherein said classifying is based on a subset of the gene expression profile consisting of gene expression levels for PAK3, RPL23, EI24, TGFB3, RBM34, PCOLN3, FUT7, RICS Rho, MAP4K4, CUTLl, and ZNF324B.
74. A method of assigning a prognosis class to a patient having prostate cancer comprising: a) obtaining data relating to said patient, wherein the data includes a gene expression profile for a plurality of genes (i) comprising TGFB3, PAK3, RBM34, RPL23, and EI24, or (ii) consisting of any two, three, four, or five of TGFB3, PAK3, RBM34, RPL23, and EI24, along with up to 0, 5, 10, 25, 50, 100, 250, 500, 1000, 1500, or 2000 additional genes; and b) classifying the patient as belonging to a particular prognosis class based upon a subset of the gene expression profile and optionally based also on a post-operative nomogram, wherein the subset (i) comprises gene expression levels for TGFB3, PAK3, RBM34, RPL23, and E124, or (ii) consists of gene expression levels for any two, three, four, or five of TGFB3, PAK3, RBM34, RPL23, and EI24, along with up to 0, 5. 10, 25, 50, 100, 250, 500, 1000, 1500, or 2000 additional genes; and wherein the prognosis class relates to risk of recurrence of prostate cancer.
75. The method of claim 74, wherein said classifying is based on a subset of the gene expression profile consisting of gene expression levels for TGFB3, PAK3, RBM34, RPL23, and EI24.
76. A method of providing treatment to a prostate cancer patient comprising: a) obtaining a biological sample from the prostate cancer patient; b) analyzing the biological sample to obtain a gene expression profile for a plurality of genes (i) comprising TGFB3, PAK3, RBM34, RPL23, and EI24, or (ii) consisting of any two, three, four, or five of TGFB3, PAK3, RBM34, RPL23, and E124, along with up to 0, 5. 10, 25, 50, 100, 250, 500, 1000, 1500, or 2000 additional genes; c) classifying the patient as belonging to a prognosis class based upon a subset of the gene expression profile and optionally based also on a post-operative nomogram, wherein the subset (i) comprises gene expression levels for TGFB3. PAK3, RBM34, RPL23, and EI24, or (ii) consists of gene expression levels for any two, three, four, or five of TGFB3, PAK3, RBM34, RPL23, and E124, along with up to 0, 5, 10, 25, 50, 100. 250, 500, 1000, 1500, or 2000 additional genes; and d) providing treatment to the prostate cancer patient based wholly or in part on the patient's prognosis class.
77. The method of claim 76, wherein said classifying is based on a subset of the gene expression profile consisting of gene expression levels for TGFB3, PAK3, RBM34, RPL23. and EI24.
PCT/US2008/084325 2007-11-21 2008-11-21 Methods of feature selection through local learning; breast and prostate cancer prognostic markers WO2009067655A2 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US98959207P 2007-11-21 2007-11-21
US60/989,592 2007-11-21
US4023708P 2008-03-28 2008-03-28
US4023208P 2008-03-28 2008-03-28
US61/040,232 2008-03-28
US61/040,237 2008-03-28

Publications (2)

Publication Number Publication Date
WO2009067655A2 true WO2009067655A2 (en) 2009-05-28
WO2009067655A3 WO2009067655A3 (en) 2009-09-03

Family

ID=40668094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/084325 WO2009067655A2 (en) 2007-11-21 2008-11-21 Methods of feature selection through local learning; breast and prostate cancer prognostic markers

Country Status (1)

Country Link
WO (1) WO2009067655A2 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011110751A1 (en) * 2010-03-12 2011-09-15 Medisapiens Oy A method, an arrangement and a computer program product for analysing a biological or medical sample
WO2012107786A1 (en) 2011-02-09 2012-08-16 Rudjer Boskovic Institute System and method for blind extraction of features from measurement data
WO2015188395A1 (en) * 2014-06-13 2015-12-17 周家锐 Big data oriented metabolome feature data analysis method and system thereof
WO2016135168A1 (en) * 2015-02-24 2016-09-01 Ruprecht-Karls-Universität Heidelberg Biomarker panel for the detection of cancer
CN107391433A (en) * 2017-06-30 2017-11-24 天津大学 A kind of feature selection approach based on composite character KDE conditional entropies
CN108763873A (en) * 2018-05-28 2018-11-06 苏州大学 A kind of gene sorting method and relevant device
CN109248542A (en) * 2017-07-12 2019-01-22 中国石油化工股份有限公司 The optimal adsorption time of pressure-swing absorption apparatus determines method and system
CN109599175A (en) * 2018-12-04 2019-04-09 中山大学孙逸仙纪念医院 A kind of analysis of joint destroys the device and method of progress probability
CN109735622A (en) * 2019-03-07 2019-05-10 天津市第三中心医院 LncRNA relevant to colorectal cancer and its application
CN110070916A (en) * 2019-04-29 2019-07-30 安徽大学 A kind of Cancerous disease gene expression characteristics selection method based on historical data
CN110489660A (en) * 2019-07-22 2019-11-22 武汉大学 A kind of user's economic situation portrait method of social media public data
US11062229B1 (en) * 2016-02-18 2021-07-13 Deepmind Technologies Limited Training latent variable machine learning models using multi-sample objectives
CN113177604A (en) * 2021-05-14 2021-07-27 东北大学 High-dimensional data feature selection method based on improved L1 regularization and clustering
US11746380B2 (en) 2016-10-05 2023-09-05 University Of East Anglia Classification and prognosis of cancer

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065535A1 (en) * 2001-05-01 2003-04-03 Structural Bioinformatics, Inc. Diagnosing inapparent diseases from common clinical tests using bayesian analysis
US20030233197A1 (en) * 2002-03-19 2003-12-18 Padilla Carlos E. Discrete bayesian analysis of data
US20040209290A1 (en) * 2003-01-15 2004-10-21 Cobleigh Melody A. Gene expression markers for breast cancer prognosis
US20040265830A1 (en) * 2001-10-17 2004-12-30 Aniko Szabo Methods for identifying differentially expressed genes by multivariate analysis of microaaray data
US20050048542A1 (en) * 2003-07-10 2005-03-03 Baker Joffre B. Expression profile algorithm and test for cancer prognosis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065535A1 (en) * 2001-05-01 2003-04-03 Structural Bioinformatics, Inc. Diagnosing inapparent diseases from common clinical tests using bayesian analysis
US20040265830A1 (en) * 2001-10-17 2004-12-30 Aniko Szabo Methods for identifying differentially expressed genes by multivariate analysis of microaaray data
US20030233197A1 (en) * 2002-03-19 2003-12-18 Padilla Carlos E. Discrete bayesian analysis of data
US20040209290A1 (en) * 2003-01-15 2004-10-21 Cobleigh Melody A. Gene expression markers for breast cancer prognosis
US20050048542A1 (en) * 2003-07-10 2005-03-03 Baker Joffre B. Expression profile algorithm and test for cancer prognosis

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9940383B2 (en) 2010-03-12 2018-04-10 Medisapiens Oy Method, an arrangement and a computer program product for analysing a biological or medical sample
US9020934B2 (en) 2010-03-12 2015-04-28 Medisapiens Oy Method, an arrangement and a computer program product for analysing a biological or medical sample
WO2011110751A1 (en) * 2010-03-12 2011-09-15 Medisapiens Oy A method, an arrangement and a computer program product for analysing a biological or medical sample
WO2012107786A1 (en) 2011-02-09 2012-08-16 Rudjer Boskovic Institute System and method for blind extraction of features from measurement data
WO2015188395A1 (en) * 2014-06-13 2015-12-17 周家锐 Big data oriented metabolome feature data analysis method and system thereof
EA037995B1 (en) * 2015-02-24 2021-06-21 Рупрехт-Карлс-Университет Гейдельберг Biomarker panel for the detection of cancer
JP2021052777A (en) * 2015-02-24 2021-04-08 ルプレクト−カールズ−ウニベルシタット ハイデルベルク Biomarker panel for detection of cancer
JP2018509934A (en) * 2015-02-24 2018-04-12 ルプレクト−カールズ−ウニベルシタット ハイデルベルク Biomarker panel for cancer detection
AU2016223532B2 (en) * 2015-02-24 2021-07-01 Ruprecht-Karls-Universitat Heidelberg Biomarker panel for the detection of cancer
WO2016135168A1 (en) * 2015-02-24 2016-09-01 Ruprecht-Karls-Universität Heidelberg Biomarker panel for the detection of cancer
US10941451B2 (en) 2015-02-24 2021-03-09 Ruprecht-Karls-Universitat Heidelberg Biomarker panel for the detection of cancer
US11062229B1 (en) * 2016-02-18 2021-07-13 Deepmind Technologies Limited Training latent variable machine learning models using multi-sample objectives
US11746380B2 (en) 2016-10-05 2023-09-05 University Of East Anglia Classification and prognosis of cancer
CN107391433A (en) * 2017-06-30 2017-11-24 天津大学 A kind of feature selection approach based on composite character KDE conditional entropies
CN109248542A (en) * 2017-07-12 2019-01-22 中国石油化工股份有限公司 The optimal adsorption time of pressure-swing absorption apparatus determines method and system
CN109248542B (en) * 2017-07-12 2021-03-05 中国石油化工股份有限公司 Method and system for determining optimal adsorption time of pressure swing adsorption device
CN108763873A (en) * 2018-05-28 2018-11-06 苏州大学 A kind of gene sorting method and relevant device
CN109599175A (en) * 2018-12-04 2019-04-09 中山大学孙逸仙纪念医院 A kind of analysis of joint destroys the device and method of progress probability
CN109735622A (en) * 2019-03-07 2019-05-10 天津市第三中心医院 LncRNA relevant to colorectal cancer and its application
CN110070916A (en) * 2019-04-29 2019-07-30 安徽大学 A kind of Cancerous disease gene expression characteristics selection method based on historical data
CN110070916B (en) * 2019-04-29 2023-04-18 安徽大学 Historical data-based cancer disease gene characteristic selection method
CN110489660B (en) * 2019-07-22 2020-12-18 武汉大学 User economic condition portrait method of social media public data
CN110489660A (en) * 2019-07-22 2019-11-22 武汉大学 A kind of user's economic situation portrait method of social media public data
CN113177604A (en) * 2021-05-14 2021-07-27 东北大学 High-dimensional data feature selection method based on improved L1 regularization and clustering
CN113177604B (en) * 2021-05-14 2024-04-16 东北大学 High-dimensional data feature selection method based on improved L1 regularization and clustering

Also Published As

Publication number Publication date
WO2009067655A3 (en) 2009-09-03

Similar Documents

Publication Publication Date Title
WO2009067655A2 (en) Methods of feature selection through local learning; breast and prostate cancer prognostic markers
US11636288B2 (en) Platform, device and process for annotation and classification of tissue specimens using convolutional neural network
CA2388595C (en) Methods and devices for identifying patterns in biological systems and methods for uses thereof
US6789069B1 (en) Method for enhancing knowledge discovered from biological data using a learning machine
US6760715B1 (en) Enhancing biological knowledge discovery using multiples support vector machines
US6714925B1 (en) System for identifying patterns in biological data using a distributed network
Henderson et al. A molecular map of mesenchymal tumors
US7117188B2 (en) Methods of identifying patterns in biological systems and uses thereof
McDermott et al. Challenges in biomarker discovery: combining expert insights with statistical analysis of complex omics data
Feng et al. Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective
Simon Development and validation of biomarker classifiers for treatment selection
AU2003214724B2 (en) Medical applications of adaptive learning systems using gene expression data
CA2435254C (en) Methods of identifying patterns in biological systems and uses thereof
AU2002253879A1 (en) Methods of identifying patterns in biological systems and uses thereof
EP2406729A1 (en) A method for the systematic evaluation of the prognostic properties of gene pairs for medical conditions, and certain gene pairs identified
US20090319450A1 (en) Protein search method and device
Zhang et al. LSCDFS-MKL: a multiple kernel based method for lung squamous cell carcinomas disease-free survival prediction with pathological and genomic data
Sakellariou et al. Investigating the minimum required number of genes for the classification of neuromuscular disease microarray data
Reyes et al. A supervised methodology for analyzing dysregulation in splicing machinery: an application in cancer diagnosis
Grate et al. Integrated analysis of transcript profiling and protein sequence data
Maalej et al. Risk Factors of Breast Cancer Determination: a Comparative Study on Different Feature Selection Techniques
WO2023230617A2 (en) Bladder cancer biomarkers and methods of use
Ram et al. Ensembling Model Approach for Prediction of Pancreatic Cancer Using a Biomarker Panel and Multi-Model Classifiers
WO2024079279A1 (en) Disease characterisation
Olman et al. Gene expression data analysis in subtypes of ovarian cancer using covariance analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08851210

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08851210

Country of ref document: EP

Kind code of ref document: A2