WO2015066564A1 - Méthodes d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et trousses associées - Google Patents

Méthodes d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et trousses associées Download PDF

Info

Publication number
WO2015066564A1
WO2015066564A1 PCT/US2014/063594 US2014063594W WO2015066564A1 WO 2015066564 A1 WO2015066564 A1 WO 2015066564A1 US 2014063594 W US2014063594 W US 2014063594W WO 2015066564 A1 WO2015066564 A1 WO 2015066564A1
Authority
WO
WIPO (PCT)
Prior art keywords
biomarkers
biomarker
subject
classification
human
Prior art date
Application number
PCT/US2014/063594
Other languages
English (en)
Inventor
Christopher P. SZUSTKIEWICZ
Cherylle GOEBEL
Thomas C. Long
Chris LOUDEN
Joel MICHALEK
Original Assignee
Cancer Prevention And Cure, Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cancer Prevention And Cure, Ltd. filed Critical Cancer Prevention And Cure, Ltd.
Publication of WO2015066564A1 publication Critical patent/WO2015066564A1/fr

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57407Specifically defined cancers
    • G01N33/57423Specifically defined cancers of lung
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the invention relates to the detection, identification, and diagnosis of lung disease using biomarkers and kits thereof, as well as systems that assist in determining the likelihood of the presence or absence of a disease based on the biomarkers. More specifically, the invention relates to the diagnosis of non-small cell lung cancers ( SCLC) and reactive airway diseases by measuring expression levels of specific biomarkers and inputting these measurements into a classification system such as a support vector machine.
  • SCLC non-small cell lung cancers
  • a classification system such as a support vector machine.
  • TRISS Trauma Revised Injury Severity Score
  • a logistic discrimination model is a logistic regression model that transforms the predicted probabilities to group labels.
  • the logistic regression model is based on the assumption that the effect of each covariate is linear with respect to the log-odds of the event (Harrell, Frank. Regression Modeling Strategies. New York : Springer, 2001, p. 217). From the point of view of classification, linearity of each covariate with respect to the log odds of the event may be sufficient to achieve a high accuracy, even in the test set; a violation of this assumption, however, could cause the model to grossly misestimate the effect and therefore result in poor performance.
  • kernels for determining the similarity of a pair of patterns.
  • These kernels are usually defined for patterns that can be represented as a vector of real numbers.
  • the linear kernel, radial basis kernel and polynomial kernel all measure the similarity of a pair of real vectors.
  • Such kernels are appropriate when the data can best be represented in this way, as a sequence of real numbers.
  • the choice of kernel corresponds to the choice of representation of the data in the feature space.
  • the patterns have a greater degree of structure. These structures can be exploited to improve the performance of the learning algorithm.
  • Examples of the types of structured data that commonly occur in machine learning applications are strings, documents, trees, graphs, such as websites or chemical molecules, signals, such as microarray expression profiles, spectra, images, spatio-temporal data, relational data and biochemical concentrations, amongst others.
  • the present invention addresses these needs by providing a classification system that uses robust methods of evaluating certain biomarkers in a subject using various classifiers such as support vector machines, AdaBoost, penalized logistic regression, naive Bayes classifiers, neural nets, k-nearest neighbor classifiers, random forests, and/ or any combination of the above.
  • classifiers such as support vector machines, AdaBoost, penalized logistic regression, naive Bayes classifiers, neural nets, k-nearest neighbor classifiers, random forests, and/ or any combination of the above.
  • the inventors have developed a method of physiological characterization, based in part on a classification according to this invention, in a subject comprising first obtaining a physiological sample of the subject; then determining biomarker measures of a plurality of biomarkers in that sample; and finally classifying the sample based on the biomarker measures using a classification system, where the classification of the sample correlates to a physiologic state or condition, or changes in a disease state in the subject.
  • the classification system includes a machine learning system, such as a Kernel or classification and regression tree based classification system.
  • a machine learning system may include classifiers including, but not limited to, a support vector machine (SVM), AdaBoost, penalized logistic regression, naive Bayes classifiers, neural nets, k- nearest neighbor classifiers, random forests, and/ or any combination of the above.
  • classifiers including, but not limited to, a support vector machine (SVM), AdaBoost, penalized logistic regression, naive Bayes classifiers, neural nets, k- nearest neighbor classifiers, random forests, and/ or any combination of the above.
  • the inventors' have also provided a method of classifying test data which comprises a plurality of biomarker measures of each of a set of biomarkers, the method comprising steps of receiving test data comprising a plurality of biomarker measures for the set of biomarkers in a mammalian test subject; then evaluating the test data using an electronic representation of a support vector machine that has been trained using an electronically stored set of training data vectors, each training data vector representing an individual mammal and comprising a biomarker measure of each biomarker of the set of biomarkers for the respective mammal, each training data vector further comprising a classification with respect to a disease state of the respective mammal; and finally outputting a classification of the mammal test subject based on the evaluating step.
  • the mammalian test subject is human.
  • the step of evaluating comprises accessing the electronically stored set of training data vectors.
  • the inventors have provided a method of training a support vector machine to produce a model for classification of test data comprising a plurality of biomarker measures of each of a set of biomarkers, the method comprising steps of accessing an electronically stored set of training data vectors, each training data vector representing an individual human and comprising a biomarker measure of each biomarker of the set of biomarkers for the respective human, each training data vector further comprising a classification with respect to a disease state of the respective human and using the electronically stored set of training data vectors to train an electronic representation of a classification system including the classifiers, alone or in combination, where each classifier is trained using the same set of training data vectors.
  • the inventors' methods classify test subjects with respect to the presence or absence of a disease state, which preferably is a lung disease, more preferably is either non-small cell lung cancer or a reactive airway disease, such as asthma.
  • the biomarker measures may comprise plasma concentration measures of at least one protein selected from the biomarkers described in the Examples.
  • the biomarker measures comprise plasma concentrations of at least four distinct biomarkers or alternatively the biomarker measures may comprise plasma concentrations of at least six distinct biomarkers, at least ten distinct biomarkers, at least twelve distinct biomarkers, at least fifteen distinct biomarkers, at least eighteen distinct biomarkers, at least twenty distinct biomarkers, or at least twenty-five distinct biomarkers.
  • the biomarker measures are typically arranged in a vector for each subject for whom the biomarker measures are obtained.
  • each vector may include other information associated with the subject, including sex, age, smoking history, measures for additional biomarkers, other features of the subject's health history, and the like.
  • the set of training vectors may comprise at least 30 vectors, at least 50 vectors, or at least 100 vectors.
  • the inventors have also provided a system for classifying test data comprising a plurality of biomarker measures of each of a set of biomarkers, where the system comprises a computer, the computer comprising an electronic representation of a support vector machine, AdaBoost, penalized logistic regression, naive Bayes classifiers, neural nets, k-nearest neighbor classifiers, random forests, and/ or any combination thereof, which may be trained using an electronically stored set of training data vectors, each training data vector representing an individual human and comprising a biomarker measure of each biomarker of the set of biomarkers for the respective human, each training data vector further comprising a classification with respect to a disease state of the respective human, the electronically stored set of training data vectors being operatively coupled to the computer, and the computer also being configured to receive test data comprising a plurality of biomarker measures for the set of biomarkers in a human test subject, and the computer further being configured to evaluate the test data using electronic representation of a support vector machine, AdaBoost
  • the inventors have also provided a system for classifying test data comprising a biomarker measure of each of a set of biomarkers, where the system comprises a computer which in turn comprises a electronic representation of a support vector machine, AdaBoost, penalized logistic regression, naive Bayes classifiers, neural nets, k-nearest neighbor classifiers, random forests, and/or any combination thereof, trained to classify test data with respect to a disease state of the test subject, the training based on an electronically stored set of training data vectors, each training data vector representing an individual human and comprising a biomarker measure of each biomarker of the set of biomarkers for the respective human, each training data vector further comprising a classification with respect to a disease state of the respective human; the computer configured to receive test data comprising a plurality of biomarker measures for the set of biomarkers in a human test subject, the computer further configured to evaluate the test data using the trained electronic representation of a support vector machine, AdaBoost, penalized logistic regression,
  • the computer in any embodiment of the system may be further configured to select the set of biomarkers from a superset of biomarkers using logic configured to (a) for each biomarker in the superset of biomarkers, calculate a distance between a marginal distribution of two groups of concentration measures for each biomarker, whereby a plurality of distances are generated; (b) order the biomarkers in the superset of biomarkers according to the distances, whereby an ordered set of biomarkers is generated; (c) for each of a plurality of initial segments of the ordered set of biomarkers, calculate a measure of model fit based on the training data; (d) select an initial segment of the ordered set of biomarkers according to a maximum measure of model fit, such that a preferred initial segment of the ordered set of biomarkers is selected; (e) starting with the null set of biomarkers, recursively add additional biomarkers from the preferred initial segment of the ordered set of biomarkers to generate the subset of biomarkers, where each additional biomarker is
  • the inventors' improvements which make up the present invention include a method of physiological characterization, based in part on a classification according to this invention, in a subject comprising (a) receiving, on at least one processor, biomarker measures of a plurality of biomarkers in a physiological sample of the subject; and (b) classifying the sample based on the biomarker measures, using a classification system and the at least one processor, where the sample is classified to indicate the likelihood of presence or development of non-small cell lung cancer (NSCLC) in the subject.
  • the classification system is a machine learning system that includes the classifiers, alone or in combination.
  • a classification system may include support vector machine(s), AdaBoost, penalized logistic regression, naive Bayes classifier(s), neural net(s), k-nearest neighbor classifier(s), random forests, and/or any combination thereof.
  • a support vector machine may be included in a kernel-based classification system.
  • AdaBoost may be included in a classification and regression tree system.
  • the present invention provides a method of characterizing, based in part on classifying test data, the test data comprising a plurality of biomarker measures of each of a set of biomarkers, the method comprising: (a) accessing, using at least one processor, an electronically stored set of training data vectors, each training data vector representing an individual human and comprising a biomarker measure of each biomarker of the set of biomarkers for the respective human, each training data vector further comprising a classification with respect to the presence or absence of diagnosed NSCLC in the respective human; (b) training an electronic representation of support vector machine(s), AdaBoost classifier(s), penalized logistic regression, regression tree system(s), naive Bayes classifier(s), neural nets, k-nearest neighbor classifier 0s, random forests, and/ or any combination thereof, using the electronically stored set of training data vectors; (c) receiving, at the at least one processor, test data comprising a plurality of biomarker measures for the set of
  • the set of biomarkers used in the invention includes at least five biomarkers selected from the preferred biomarker group consisting of MCSF, Eotaxin-3, CTACK, TNF-b, FGF-basic, Survivin, TNFR1, BDNF, MIG, MIF, IL-23pl9, IL12-p40, Leptin, IL-2ra and IL-8, more preferably at least 7 biomarkers from the preferred group, even more preferably at least 10, and even more preferably all of the preferred group of biomarkers.
  • the set of biomarkers comprises no more than 20 biomarkers.
  • test data and each training data vector further comprises at least one additional characteristic selected from the group consisting of the sex, age and smoking status of the individual human.
  • classifiers including for example, support vector machine, AdaBoost, penalized logistic regression, naive Bayes classifiers, neural nets, k-nearest neighbor classifiers, random forests, and/ or any combination thereof, of the present invention comprises one or more Kernel functions selected from linear kernels, radial basis Kernels, polynomial Kernels, uniform Kernels, triangle Kernels, Epanechnikov Kernels, quartic (biweight) Kernels, tricube (triweight) Kernels, and cosine Kernels, of the present invention comprises at least 5, 10, 15, 20, 25, 30, 40, 50, 75 or 100 iterations.
  • the present invention includes obtaining replicate samples from a human subject, where the preferred number of replicates is at least two or at least three, determining biomarker measures for each of a plurality of biomarkers in each sample, and determining a classification for each replicate sample using the classifiers, such as support vector machine(s), AdaBoost classifier(s), penalized logistic regression, regression tree system(s), naive Bayes classifier(s), neural nets, k-nearest neighbor classifierQs, random forests, and/or any combination of the above, wherein the human subject is considered positive for NSCLC if any one of the replicate samples is classified positive for NSCLC by a classifier, such as the support vector machine(s), AdaBoost classifier(s), penalized logistic regression, regression tree system(s), naive Bayes classifier(s), neural nets, fe-nearest neighbor classifierQs, random forests, and/or any combination of the above.
  • classifiers such as support vector machine(s), AdaBoost classifier
  • a human subject is considered positive for NSCLC if any of the replicate sample from the subject is classified positive by any one, any two, any three, any four, any five, any six, any seven, or any eight classifiers (up to all classifiers).
  • a subject may be considered positive if multiple replicates for a single classifier (e.g., all replicates for each classifier, two or more replicates for a single classifier, three replicates for a single classifier, etc.) or if multiple replicates across all classifiers used (e.g., two replicates across the number of classifiers used in an ensemble of classifiers, three replicates across the number of classifiers used in an ensemble of classifiers, four replicates across the number of classifiers used in an ensemble of classifiers, etc.) are classified as positive.
  • multiple replicates for a single classifier e.g., all replicates for each classifier, two or more replicates for a single classifier, three replicates for a single classifier, etc.
  • multiple replicates across all classifiers used e.g., two replicates across the number of classifiers used in an ensemble of classifiers, three replicates across the number of classifiers used in an ensemble of classifiers, four replicates across the number of classifiers
  • At least one, more preferably two or more of, accuracy, sensitivity, positive predictive value and negative predictive value is above 0.9. In preferred modes of any embodiment(s) described herein, at least one of, more preferably two or more of, accuracy, sensitivity, specificity, positive predictive value and negative predictive value is above 0.95. In preferred modes of any embodiment(s) described herein, at least one of, more preferably two or more of, accuracy, sensitivity, specificity, positive predictive value and negative predictive value is above 0.98.
  • the human subject which is the source of the sample in any embodiment is "at-risk" for NSCLC.
  • the invention also contemplates that a human subject diagnosed with NSCLC based, in part, on the output according to any embodiment of this invention, will be treated for NSCLC.
  • the present invention also includes a system for use in carrying out the methods of this invention, the system providing for characterizing, based in part on classifying test data, the test data comprising a plurality of biomarker measures of each of a set of biomarkers, the system comprising at least one processor coupled to electronic storage means comprising an electronic representation of a support vector machine(s), AdaBoost classifier(s), penalized logistic regression, regression tree system(s), naive Bayes classifier(s), neural nets, k-nearest neighbor classifierQs, random forests, and/or any combination thereof, where the electronic representation has been trained using an electronically stored set of training data vectors, each training data vector representing an individual human and comprising a biomarker measure of each biomarker of the set of biomarkers for the respective human, each training data vector further comprising a classification with respect to a disease state of the respective human, where the disease state is preferably the presence or the likelihood of development of NSCLC, and the at least one processor configured (a) to receive test
  • the present invention also contemplates a non-transitory computer-readable storage medium with an executable program stored thereon, where the program instructs a processor to perform the following steps: (a) receiving biomarker measures of a plurality of biomarkers in-a physiological sample of the subject; and (b) classifying the sample based on the biomarker measures, using a classification system and the at least one processor, so that the classification of the sample is indicative of the likelihood of presence or development of non-small cell lung cancer (NSCLC) in the subject.
  • NSCLC non-small cell lung cancer
  • the embodiments of the present invention can be used in an enhanced method for screening a human subject to determine whether or not the human is likely to suffer from NSCLC, the enhancement comprising classifying test data from the human subject using the method according to any one of the embodiments of the invention, where the human subject is one who exhibits at least one lung nodule detectable by computerized tomography scan.
  • An alternative use for the embodiments of the present invention provides another enhanced method for screening a human subject to determine whether or not the human is likely to suffer from NSCLC, where a human subject classified positive for NSCLC using the method of this invention is further tested for lung nodules by low-dose computerized tomography.
  • the methods and systems provided herein are capable of diagnosing and predicting lung pathologies ⁇ e.g., cancerous, asthmatic) typically with over 90% accuracy ⁇ e.g., total correct over total tested). These results provide a significant advancement over currently available methods for diagnosing and predicting lung pathologies such as non-small cell lung cancer.
  • FIGURE 1A shows the average fluorescence intensity level of the biomarkers in the normal (NO) population from Example 1, as well as the standard deviation and relative standard deviation.
  • FIGURE IB shows the average fluorescence intensity level of the biomarkers in the non-small cell lung cancer (LC) population from Example 1, as well as the standard deviation and relative standard deviation.
  • FIGURE 1C shows the average fluorescence intensity level of the biomarkers in the asthma (AST) population from Example 1, as well as the standard deviation and relative standard deviation.
  • FIGURE ID shows the percent change in the mean of fluorescence intensity for each of the biomarkers in the AST population v. NO population, LC population v. NO populations, and the AST population v. LC population from Example 1.
  • FIGURE 2A shows the average fluorescence intensity level of the biomarkers in the normal (NO) female population from Example 1, as well as the standard deviation and relative standard deviation.
  • FIGURE 2B shows the average fluorescence intensity level of the biomarkers in the non-small cell lung cancer (LC) female population from Example 1, as well as the standard deviation and relative standard deviation.
  • LC non-small cell lung cancer
  • FIGURE 2C shows the average fluorescence intensity level of the biomarkers in the asthma (AST) female population from Example 1 , as well as the standard deviation and relative standard deviation.
  • FIGURE 3B shows the average fluorescence intensity level of the biomarkers in the non-small cell lung cancer (LC) male population from Example 1 , as well as the standard deviation and relative standard deviation.
  • FIGURE 3C shows the average fluorescence intensity level of the biomarkers in the asthma (AST) male population from Example 1 , as well as the standard deviation and relative standard deviation.
  • FIGURE 4 shows the percent change in the mean of fluorescence intensity for each of the biomarkers in the AST male population compared to the AST female population, the LC male population compared to the LC female population, and the NO male population compared to the NO female population from Example 1.
  • FIGURE 5 shows the relationship of various molecules to HGF (Hepatocyte Growth
  • This figure was generated by the ARIADNE PATHWAY STUDIO®.
  • FIGURE 6 shows ROC Curve for AdaBoost.
  • FIGURE 9 shows ROC Curve for AdaBoost with restriction to females.
  • FIGURE 10 shows a variable selection plot based on the AdaBoost model.
  • FIGURE 11 shows a variable selection plot based on the AdaBoost model for males.
  • FIGURE 12 shows a variable selection plot based on the AdaBoost model for females.
  • FIGURE 13 shows distribution of accuracy of the AdaBoost Model.
  • FIGURE 15 shows distribution of specificity of the AdaBoost Model.
  • the invention relates to various methods of detection, identification, and diagnosis of lung disease using biomarkers. These methods involve determining biomarker measures of specific biomarkers and using these biomarker measures in a classification system to determine the likelihood that an individual has non-small cell lung cancer and/or reactive airway disease ⁇ e.g., asthma, chronic obstructive pulmonary disease, etc).
  • the invention also provides for kits comprising detection agents for detecting these biomarkers, or means for determining the biomarker measures of these biomarkers, as components of systems for assisting in determining the likelihood of lung disease.
  • biomarkers were identified by measuring the expression levels of fifty-nine selected biomarkers in the plasma of patients from populations who had been diagnosed with non- small cell lung cancers or asthma, as well as patients who had not been diagnosed with non-small cell lung cancers and/or asthma, as confirmed by a physician. This method is detailed in Example 1. Biomarkers were identified by measuring the expression levels of eighty-six selected biomarkers in the plasma of patients from populations who had been diagnosed with non-small cell lung cancers or asthma, as well as patients who had not been diagnosed with non-small cell lung cancers and/ or asthma, as confirmed by a physician. This method is detailed in Example 10.
  • biomarkers were identified by measuring the expression levels of 104 selected biomarkers in the plasma of patients from populations who had been diagnosed with non-small cell lung cancers or asthma, as well as patients who had not been diagnosed with non-small cell lung cancers and/or asthma, as confirmed by a physician. This method is detailed in Example 17.
  • a “biomarker” or “marker” is a biological molecule that can be objectively measured as a characteristic indicator of the physiological status of a biological system.
  • biological molecules include ions, small molecules, peptides, proteins, peptides and proteins bearing post-translational modifications, nucleosides, nucleotides and polynucleotides including RNA and DNA, glycoproteins, lipoproteins, as well as various covalent and non-covalent modifications of these types of molecules.
  • Biological molecules include any of these entities native to, characteristic of, and/ or essential to the function of a biological system.
  • the majority of biomarkers are polypeptides, although they may also be mRNA or modified mRNA which represents the pre-translation form of a gene product expressed as the polypeptide, or they may include post-translational modifications of the polypeptide.
  • a “biomarker measure” is information relating to a biomarker that is useful for characterizing the presence or absence of a disease. Such information may include measured values which are, or are proportional to, concentration, or that are otherwise provide qualitative or quantitative indications of expression of the biomarker in tissues or biologic fluids.
  • Each biomarker can be represented as a dimension in a vector space, where each vector is a multidimensional vector in the vector space and includes of a plurality of biomarker measures associated with a particular subject.
  • classifier is a machine learning algorithm such as support vector machine(s), AdaBoost classifier(s), penalized logistic regression, regression tree system(s), naive Bayes classifier(s), neural nets, k-nearest neighbor classifierQs, random forests.
  • This invention contemplates methods using any of the listed classifiers, as well as use of more than one of the classifiers in combination.
  • classification system is a machine learning system executing at least one classifier.
  • subset is a proper subset
  • superset is a proper superset
  • a “subject” means any animal, but is preferably a mammal, such as, for example, a human. In many embodiments, the subject will be a human patient having, or at-risk of having, a lung disease.
  • a "physiological sample” includes samples from biological fluids and tissues.
  • Biological fluids include whole blood, blood plasma, blood serum, sputum, urine, sweat, lymph, and alveolar lavage.
  • Tissue samples include biopsies from solid lung tissue or other solid tissues, lymph node biopsy tissues, biopsies of metastatic foci. Methods of obtaining physiological samples are well known.
  • One type of interacting molecule is commonly known as a receptor. Such receptors bind ligands, which are also interacting molecules. Another type of direct intermolecular interaction is the binding of a co-factor or an allosteric effector to an enzyme. These intermolecular interactions form networks of signaling molecules that work together to carry out and control the essential life functions of cells and organisms.
  • Each of these interacting molecules are biomarkers within the terminology of this invention.
  • the particular biomarkers of this invention are linked physiologically to other biomarkers whose level increases or decreases in a fashion coordinated with the level of particular biomarkers. These other linked biomarkers are called "first order interactors" with respect to the particular biomarkers of the invention.
  • First order interactors are those molecular entities that interact directly with a particular biological molecule. For instance the drug morphine interacts directly with opiate receptors resulting ultimately in the diminishment of the sensation of pain. Thus, the opiate receptors are first order interactors under the definition of "first order interactor.”
  • First order interactors include both upstream and downstream direct neighbors for said biomarkers in the communication pathways through which they interact. These entities encompass proteins, nucleic acids and small molecules which may be connected by relationships that include but are not limited to direct (or indirect) regulation, expression, chemical reaction, molecular synthesis, binding, promoter binding, protein modification and molecular transport.
  • biomarkers whose levels are coordinated are well known to those skilled in the art and those knowledgeable in physiology and cellular biology. Indeed, first order interactors for a particular biomarker are known in the art and can found using various databases and available bioinformatics software such as ARIADNE PATHWAY STUDIO®, ExPASY Proteomics Server Qlucore Omics Explorer, Protein Prospector, PQuad, ChEMBL, and others, ⁇ see, e.g., ARIADNE PATHWAY STUDIO®, Ariadne, Inc., ⁇ www.ariadne.genomics.com> or ChEMBL Database, European Bioinformatics Institute, European Molecular Biology Laboratory, ⁇ www.ebi.ac.uk>).
  • First order interactor biomarkers are those whose expression level is coordinated with another biomarker. Therefore, information regarding levels of a particular biomarker (a "biomarker measure”) may be derived from measuring the level of a first order interactor coordinated with that particular biomarker. The skilled person will of course confirm that the level of a first order interactor which is used in lieu or in addition to a particular biomarker will vary in a defined and reproducible way consistent with the behavior of the particular biomarker.
  • a biomarker measure is information that generally relates to a quantitative measurement of an expression product, which is typically a protein or polypeptide.
  • the invention contemplates determining the biomarker measure at the RNA (pre-translational) or protein level (which may include post-translational modification).
  • the invention contemplates determining changes in biomarker concentrations reflected in an increase or decrease in the level of transcription, translation, post-transcriptional modification, or the extent or degree of degradation of protein, where these changes are associated with a particular disease state or disease progression.
  • disease may be characterized by a pattern of expression of a plurality of markers.
  • the determination of expression levels for a plurality of biomarkers facilitates the observation of a pattern of expression, and such patterns provide for more sensitive and more accurate diagnoses than detection of individual biomarkers.
  • a pattern may comprise abnormal elevation of some particular biomarkers simultaneously with abnormal reduction in other particular biomarkers.
  • physiological samples are collected from subjects in a manner which ensures that the biomarker measure in the sample is proportional to the concentration of that biomarker in the subject from which the sample is collected. Measurements are made so that the measured value is proportional to the concentration of the biomarker in the sample. Selecting sampling techniques and measurement techniques which meet these requirements is within ordinary skill of the art.
  • biomarker measures are known in the art for individual biomarkers. See Instrumental Methods of Analysis, Seventh Edition, 1988. Such determination may be performed in a multiplex or matrix-based format such as a multiplexed immunoassay.
  • Means for such determination include, but are not limited to, radio-immuno assay, enzyme-linked immunosorbent assay (ELISA), Q-PlexTM Multiplex Assays, liquid chromatography-mass spectrometry (LCMS), flow cytometry multiplex immunoassay, high pressure liquid chromatography with radiometric or spectrometric detection via absorbance of visible or ultraviolet light, mass spectrometric qualitative and quantitative analysis, western blotting, 1 or 2 dimensional gel electrophoresis with quantitative visualization by means of detection of radioactive, fluorescent or chemiluminescent probes or nuclei, antibody-based detection with absorptive or fluorescent photometry, quantitation by luminescence of any of a number of chemiluminescent reporter systems, enzymatic assays, immunoprecipitation or immuno-capture assays, solid and liquid phase immunoassays, protein arrays or chips, DNA arrays or chips, plate assays,
  • the step of determining biomarker measures may be performed by any means known in the art, especially those means discussed herein.
  • the step of determining biomarker measures comprises performing immunoassays with antibodies.
  • the antibody chosen is preferably selective for an antigen of interest (i.e., selective for the particular biomarker) possesses a high binding specificity for said antigen, and has minimal cross-reactivity with other antigens.
  • the ability of an antibody to bind to an antigen of interest may be determined, for example, by known methods such as enzyme-linked immunosorbent assay (ELISA), flow cytometry, and immunohistochemistry.
  • ELISA enzyme-linked immunosorbent assay
  • the antibody should have a relatively high binding specificity for the antigen of interest.
  • the binding specificity of the antibody may be determined by known methods such as immunoprecipitation or by an in vitro binding assay, such as radioimmunoassay (RIA) or ELISA. Disclosure of methods for selecting antibodies capable of binding antigens of interest with high binding specificity and minimal cross-reactivity are provided, for example, in U.S. Pat. No. 7,288,249, which is hereby incorporated by reference in its entirety.
  • a single molecule array format may be used.
  • single protein molecules are captured and labeled on beads using standard immunosorbent assay reagents.
  • Thousands of beads are mixed with enzyme substrate and loaded into individual femtoliter-sized wells, and sealed with oil.
  • the fluorophore concentration of each bead is digitally counted to determine if it is bound to the target analyte or not. Disclosure of such methods are provided, for example, in U.S. Pat. Nos. 8,236,574, which is hereby incorporated by reference in its entirety.
  • Exemplary methods include using classifiers such as support vector machines, AdaBoost, penalized logistic regression, regression tree system(s), naive Bayes classifier(s), neural nets, k-nearest neighbor classifierQs, random forests, and/or any combination thereof.
  • classifiers such as support vector machines, AdaBoost, penalized logistic regression, regression tree system(s), naive Bayes classifier(s), neural nets, k-nearest neighbor classifierQs, random forests, and/or any combination thereof.
  • the invention relates to, among other things, predicting lung pathologies as cancerous or asthmatic based on multiple, continuously distributed biomarkers.
  • classifiers e.g., support vector machines.
  • AdaBoost penalized logistic regression, regression tree system(s), naive Bayes classifier(s), neural nets, k-nearest neighbor classifierQs, random forests, and/or any combination thereof
  • prediction may be a multi-step process (e.g., a two -step process, a three-step process, etc.).
  • the classifications systems described may include computer executable software, firmware, hardware, or various combinations thereof.
  • the classification systems may include reference to a processor and supporting data storage.
  • the classification systems may be implemented across multiple devices or other components local or remote to one another.
  • the classification systems may be implemented in a centralized system, or as a distributed system for additional scalability.
  • any reference to software may include non-transitory computer readable media that when executed on a computer, causes the computer to perform a series of steps.
  • the classification systems described herein may include data storage such as network accessible storage, local storage, remote storage, or a combination thereof.
  • Data storage may utilize a redundant array of inexpensive disks (“RAID”), tape, disk, a storage area network (“SAN"), an internet small computer systems interface (“iSCSI”) SAN, a Fibre Channel SAN, a common Internet File System (“CIFS”), network attached storage (“NAS”), a network file system (“NFS”), or other computer accessible storage.
  • data storage may be a database, such as an Oracle database, a Microsoft SQL Server database, a DB2 database, a MySQL database, a Sybase database, an object oriented database, a hierarchical database, or other database.
  • Data storage may utilize flat file structures for storage of data.
  • a classifier is used to describe a pre-determined set of data. This is the
  • the training database is a computer-implemented store of data reflecting a plurality of biomarker measures for a plurality of humans in association with a classification with respect to a disease state of each respective human.
  • the format of the stored data may be as a flat file, database, table, or any other retrievable data storage format known in the art.
  • the test data is stored as a plurality of vectors, each vector corresponding to an individual human, each vector including a plurality of biomarker measures for a plurality of biomarkers together with a classification with respect to a disease state of the human.
  • each vector contains an entry for each biomarker measure in the plurality of biomarker measures.
  • the training database may be linked to a network, such as the internet, such that its contents may be retrieved remotely by authorized entities ⁇ e.g., human users or computer programs). Alternately, the training database may be located in a network-isolated computer.
  • classifiers such as support vector machines, AdaBoost, decisions trees, Bayesian classifiers, Bayesian belief networks, naive Bayes classifiers, fe-nearest neighbor classifiers, case-based reasoning, penalized logistic regression, neural nets, random forests, and/or any combination thereof ⁇ See e.g., Han J & Kamber M, 2006, Chapter 6, Data Mining Concepts and Techniques, 2nd Ed. Elsevier: Amsterdam.). As described herein, any classifier and/or combination of classifiers may be used in a classification system.
  • classifiers such as support vector machines, penalized logistic regression, naive Bayes classifiers, classification trees, fe-nearest neighbor classifiers, neural nets, random forests, AdaBoost, etc. may be used to classify the data.
  • the data may be used to train a classifier.
  • Support vector machines are known in the art. For example, methods of diagnosing and predicting the occurrence of a medical condition have been proposed using support vector machines. See, e.g., U.S. Patent Nos. 7,505,948; 7,617,163; and 7,676,442, which are hereby incorporated by reference in their entirety.
  • a hyperplane is then selected by known SVM techniques such that the distance between the support vectors and the hyperplane is maximal within the bounds of a cost function that penalizes incorrect predictions.
  • This hyperplane is the one which optimally separates the data in terms of prediction ( pnik, 1998 Statistical learning Theory. New York: Wiley). Any new observation is then classified as belonging to any one of the categories of interest, based where the observation lies in relation to the hyperplane. When more than two categories are considered, the process is carried out pairwise for all of the categories and those results combined to create a rule to cHscriminate between all the categories.
  • the model is fit pairwise between the groups (a series of sub-models) with each sub-model casting a vote for a particular group. The observation is determined to belong to the group with the most votes.
  • V is a predetermined constant (the degrees of freedom).
  • Kernels polynomial Kernels, uniform Kernels, triangle Kernels, Epanechnikov Kernels, quartic (biweight) Kernels, tricube (triweight) Kernels, and cosine Kernels.
  • LASSO least absolute shrinkage and selection operator
  • the set of Bayes Classifiers are a set of classifiers based on Bayes' Theorem that
  • the Ll-norm of a real-valued vector is defined as the sum of the absolute value of the elements of the vector.
  • the L2-norm of a real- valued vector is defined as the square root of the sum of the squares of the elements of the vector [0100] All classifiers of this type seek to find the probability that an observation belongs to a class given the data for that observation. The class with the highest probability is the one to which each new observation is assigned.
  • Bayes classifiers have the lowest error rates amongst the set of classifiers. In practice, this does not always occur due to violations of the assumptions made about the data when applying a Bayes classifier.
  • the naive Bayes classifier is one example of a Bayes classifier. It simplifies the calculations of the probabilities used in classification by making the assumption that each class is independent of the other classes given the data.
  • Naive Bayes classifiers are used in many prominent anti-spam filters due to the ease of implantation and speed of classification but have the drawback that the assumptions required are rarely met in practice.
  • the distance can be calculated using any valid metric, though Euclidian and Mahalanobis 6 distances are often used.
  • the group that has the highest count is the group to which the new observation is assigned.
  • Nearest neighbor algorithms have problems dealing with categorical data due to the requirement that a distance be calculated between two points but that can be overcome by defining a distance arbitrarily between any two groups. This class of algorithm is also sensitive to changes in scale and metric. With these issues in mind, nearest neighbor algorithms can be very powerful, especially in large data sets.
  • the process of recursively splitting the data space creates a binary tree with a condition that is tested at each vertex.
  • a new observation is classified by following the branches of the tree until a leaf is reached.
  • a probability is assigned to the observation that it belongs to a given class.
  • the class with the highest probability is the one to which the new observation is classified.
  • Classification trees are essentially a decision tree whose attributes are framed in the language of statistics. They are highly flexible but very noisy (the variance of the error is large compared to other methods).
  • the Mahalanobis distance is a metric that takes into account the covariance between variables in the observations.
  • Tools for implementing classification trees as discussed herein are available for the statistical software computing language and environment, R.
  • the R package "tree,” version 1.0-28 includes tools for creating, processing and utilizing classification trees.
  • AdaBoost adaptive boosting
  • AdaBoost provides a way to classify each of n subjects into two or more 8 disease categories based on one k-dimensional vector (called a k-tuple) of measurements per subject.
  • AdaBoost takes a series of "weak" classifiers that have poor, though better than random, predictive
  • a bootstrap sample is a sample drawn with replacement from the observed data with the same number of observations as the observed data.
  • AdaBoost technically works only when there are two categories to which the observation can belong. For g>2 categories, (g/2) models must be created that classify observations as belonging to a group of not. The results from these models can then be combined to predict the group membership of the particular observation. performance 9 and combines them to create a superior classifier.
  • the weak classifiers that AdaBoost uses are classification and regression trees (CARTs). CARTs recursively partition the dataspace into regions in which all new observations that lie within that region are assigned a certain category label.
  • AdaBoost builds a series of CARTs based on weighted versions of the dataset whose weights depend on the performance of the classifier at the previous iteration (Han J & Kamber M, (2006). Data Mining, Concepts and Techniques, 2nd Ed. Elsevier: Amsterdam).
  • an ensemble method used on a classification system which combines multiple classifiers.
  • an ensemble method may include SVM, AdaBoost, penalized logistic regression, naive Bayes classifiers, classification trees, k- nearest neighbor classifiers, neural nets, random forests, and/or any combination thereof, in order to make a prediction regarding disease pathology ⁇ e.g., NSCLC or normal).
  • the ensemble method was developed to take advantage of the benefits provided by each of the classifiers, and replicate measurements of each plasma specimen.
  • the biomarker measures for each of the biomarkers in each subject's plasma are obtained for multiple samples.
  • replicate plasma samples are collected ⁇ e.g., duplicate, triplicate, etc.), and a full complement of biomarker measures are obtained for each sample.
  • Each subject may be predicted as having a disease state ⁇ e.g., as NSCLC or normal) based on each of the replicate measurements ⁇ e.g., duplicate, triplicate, etc.) using a classification system including at least one classifier, yielding multiple predictions ⁇ e.g., four predictions, six predictions, etc.).
  • the ensemble methodology may predict the subject to have NSCLC if at least one of the predictions
  • the test data may be any biomarker measures such as plasma concentration measurements of a plurality of biomarkers.
  • the invention provides a method of classifying test data, the test data comprising biomarker measures that are a plurality of plasma concentration measures of each of a set of biomarkers comprising: (a) accessing an electronically stored set of training data vectors, each training data vector or k-tuple representing an individual human and comprising biomarker measures, such as a plasma concentration measure of each of the set of biomarkers for the respective human for each replicate, the training data further comprising a classification with respect to a disease state of each respective human; (b) training an electronic representation of a classifier and/or an ensemble of classifiers as described herein using the electronically stored set of training data vectors; (c) receiving test data comprising a plurality of plasma concentration measures for a human test subject; (d) evaluating the test data using the electronic representation of the classifier and/or an ensemble of classifiers as described herein; and (e) outputting
  • the invention provides a method of classifying test data, the test data comprising biomarker measures that are a plurality of plasma concentration measures of each of a set of biomarkers comprising: (a) accessing an electronically stored set of training data vectors, each training data vector or k-tuple representing an individual human and comprising biomarker measures, such as a plasma concentration measure of each of the set of biomarkers for the respective human for each replicate, the training data further comprising a classification with respect to a disease state of each respective human; (b) using the electronically stored set of training data vectors to build a classifier and/or ensemble of classifiers; (c) receiving test data comprising a plurality of plasma concentration measures for a human test subject; (d) evaluating the test data using the classifier(s); and (e) outputting a classification of the human test subject based on the evaluating step.
  • all (or any combination of) the replicates may be averaged to produce a single value for each biomarker for each subject. Outputting in accord
  • the classification with respect to a disease state may be the presence or absence of the disease state.
  • the disease state according to this invention may be lung disease such as non-small cell lung cancer or reactive airway disease (e.g., asthma).
  • the set of training vectors may comprise at least 20, 25, 20, 35, 50, 75, 100, 125, 150, or more vectors.
  • the invention also provides for methods of classifying data (such as test data obtained from an individual) that involve reduced sets of biomarkers. That is, training data may be thinned to exclude all but a subset of biomarker measures for a selected subset of biomarkers. Likewise, test data may be restricted to a subset of biomarker measures from the same selected set of biomarkers.
  • the invention provides a method of classifying test data, the test data comprising biomarker measures that are a plurality of plasma concentration measures of each of a set of biomarkers comprising: (a) accessing an electronically stored set of training data vectors, each training data vector representing an individual human and comprising biomarker measures of each biomarker of the set of biomarkers for the respective human, each training data vector further comprising a classification with respect to a disease state of the respective human; (b) selecting a subset of biomarkers from the set of biomarkers; (c) training an electronic representation of a learning machine, such as a classifier or an ensemble of classifiers as described herein, using the data from the subset of biomarkers of the electronically stored set of training data vectors; (d) receiving test data comprising a plurality of plasma concentration measures for a human test subject related to the set of biomarkers in step (a); (e) evaluating the test data using the electronic representation of the learning machine; and (f) outputting a classification
  • the step of selecting a subset of biomarkers comprises: (i) for each biomarker in the set of biomarkers, calculating a distance between a marginal distribution of two groups of concentration measures of the biomarker, whereby a plurality of distances are generated; (ii) ordering the biomarkers in the set of biomarkers according to the distances, whereby an ordered set of biomarkers is generated; (iii) for each of a plurality of initial segments of the ordered set of biomarkers, calculating a measure of model fit for a learning machine, such as classifier or an ensemble of classifiers as described herein, based on the training data; (iv) selecting an initial segment of the ordered set of biomarkers according to a maximum measure of model fit, whereby a preferred initial segment of the ordered set of biomarkers is selected; (v) starting with the null set of biomarkers, recursively adding to the model additional biomarkers from the preferred initial segment of the ordered set of biomarkers to generate the
  • the methods, kits, and systems described herein may involve determining biomarker measures of a selected plurality of biomarkers.
  • the method comprises determining biomarker measures of a subset of any three particular biomarkers of the biomarkers described in the Examples.
  • the method comprises determining biomarker measures of a subset of at least four, five, six, or seven particular biomarkers of the biomarkers described in the Examples.
  • the method comprises determining biomarker measures of a subset of at least eight, nine, ten, eleven, twelve, or thirteen particular biomarkers of the biomarkers described in the Examples.
  • the method comprises determining biomarker measures of a subset of at least fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, or more (e.g., fifty-nine) particular biomarkers of the biomarkers described in the Examples.
  • the methods, kits, and systems described herein may use a specific subset of biomarkers (e.g., at least five biomarkers), and one or more biomarkers from another subset of biomarkers. (e.g., at least fifteen biomarkers).
  • the methods, kits, and systems described herein provide for the determination of at least the following five biomarkers: CTACK, MCSF, FGF-basic, Survivin, and Eotaxin-3.
  • CTACK CTACK
  • MCSF FGF-basic
  • Survivin and Eotaxin-3
  • Eotaxin sICAM
  • IP-10 sE-selectin
  • sVCAM.l TNF-b, TNFRl, BDNF, MIG, MIF, IL23-pl9, IL12-p40, Leptin, IL-2ra, and/or IL-8.
  • the skilled person will recognize that it is within the contemplation of this invention to contemporaneously determine biomarker measures of additional biomarkers whether or not associated with the disease of interest. Determination of these additional biomarker measures will not prevent the classification of a subject according to the present invention.
  • the maximum number of biomarkers whose measures are included in the training data and test data of any of the methods of this invention may be, for example, at least six distinct biomarkers, at least ten distinct biomarkers, at least twelve distinct biomarkers, at least fifteen distinct biomarkers, at least eighteen distinct biomarkers, at least twenty distinct biomarkers, or at least twenty-five distinct biomarkers.
  • a skilled person would understand that the number of biomarkerts should be limited to avoid inaccurate predictions due to overfitting.
  • the subsets of biomarkers may be determined by using the methods of reduction described herein.
  • the invention provides various model selection algorithms (e.g., F_SSFS) for finding subsets of biomarkers that contribute the highest measure of model fit and thus retain a high accuracy of predictability. Examples 7-10 show a reduced model of particular subsets of biomarkers.
  • the biomarkers are chosen from a computed subset which contains the biomarkers contributing a highest measure of model fit. As long as those biomarkers are included, the invention does not preclude the inclusion of a few additional biomarkers that do not necessarily contribute. Nor will including such additional biomarker measures in a classifying model preclude classification of test data, so long as the model is devised as described herein. In other embodiments, biomarker measures of no more than 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 35, 40 or 50 biomarkers are determined for the subject, and the same number of biomarkers are used in the training phase.
  • the selected biomarkers are chosen from a computed subset from which biomarkers that contribute the least to a measure of model fit have been removed. As long as those selected biomarkers are included, the invention does not preclude the inclusion of a few additional biomarkers that do not necessarily contribute. Nor will including such additional biomarker measures in a classifying model preclude classification of test data, so long as the model is devised as described herein. In other embodiments, biomarker measures of no more than 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 35, 40 or 50 biomarkers are determined for the subject, and the same number of biomarkers are used in the training phase.
  • the various combinations of biomarkers described herein are also applicable to methods for designing kits and the kits and systems described herein.
  • the number of biomarkers used by the learning machine, such as a support vector machine, to classify observations or test data using a trained model are reduced using the F_SSFS method of Lee (Lee, 2009) extended to an arbitrary number of groups.
  • the F_SSFS method (i) determines a set of variables that are good candidates to be kept in the model; and (ii) selects candidates on the basis of their F-score, 1 " which quantifies the separation between the values of the variable between the groups. Forward model selection is applied to this variable set with variables added to the model on the basis of their improvement in the accuracy of the learning machine.
  • variables are biomarkers and groups are lung pathology categories.
  • Exemplary learning machines include classifiers and/or an ensemble of classifiers as described herein.
  • a different technique for selecting a subset of biomarkers is disclosed presentiy.
  • An exemplary algorithm for this technique is comprised of the following steps:
  • the term m is the number of groups under consideration. In most instances of using a learning machine, such as a support vector machine, m— 2.
  • the term x r represents the median of the biomarker measures of g r in the set of training data vectors,.
  • the term x r t represents the median of the s- h group of the biomarker g grill where each group is defined
  • the F-score of the j *" variable is defined as where is the
  • w f is the number of observations from group /. according to training data vector classification.
  • the terms x r. ,j(0.75) and x r,j(0.25) denote the top and bottom quartile of the s-t group's distribution, respectively (for biomarker ⁇ ).
  • an alternative to using the empirical training data vector classifications to define the two groups indexed by s is to use an initial run of a support vector machine that implements all biomarkers to classify each training vector into discrete groups.
  • K t e AT For each.K t e AT , keep the first K t biomarkers in the model associated with the trained learning machine, ordered (descending) according to score, and calculate a measure of model fit (such as sensitivity or accuracy). In other words, keep an initial segment of biomarkers, ordered according to (11), in the model, and calculate a measure of model fit such as percentage of correctly-classified test vectors. (Other measures of model fit include accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, see e.g., Table 2.) This is performed for each initial segment of biomarkers (i.e., the first biomarker up to the K t -th biomarker, ordered according to (11)) for each K t in K.
  • a measure of model fit such as sensitivity or accuracy
  • K' be the K t such that the model associated with that K t has the highest measure of model fit.
  • K t the K t such that the model associated with that K t has the highest measure of model fit.
  • the threshold may be 0.0005, 0.0001, 0.005, 0.001, 0.05, 0.01, 0.5, or 0.1.
  • steps 6 and 7 adds a single biomarker to the model (unless no biomarker meets the threshold criteria) and each subsequent iteration adds an additional biomarker until the process stops according to the threshold criteria. Accordingly, steps 6-8 provide a recursive algorithm for selecting a reduced set of biomarkers.
  • steps 1 and 2 above are directed to ordering the biomarkers according to marginal distributions.
  • biomarkers may be ranked according to distance between central tendencies (e.g., medians) of marginal distributions of two groups of biomarker measures in a set of training vectors. (Alternate central tendencies, such as modes or means, may be used instead of medians.)
  • Each group corresponds to a classification, and these classifications may be obtained from the empirical classifications contained within the training data itself, or they may be obtained from an initial run of a learning machine or classification system that utilizes all biomarkers.
  • the biomarkers are ranked as a function of the discriminatory ability of the biomarker measures between the two groups, where the two groups correspond to classifications, whether empirical or generated by an initial run of a learning machine.
  • Steps 3, 4 and 5 above are directed to selecting an initial segment of the marginal- distribution-decreasingly-ordered biomarkers such that the selected initial segment has the best model fit for the set of training vectors from among the other initial segments.
  • This initial segment will serve as a universe of biomarkers from which the final, reduced, set of biomarkers are selected according to steps 6, 7 and 8.
  • Steps 6, 7 and 8 are directed to recursively adding biomarkers to the model, starting with the base case of no biomarkers.
  • the sequentially added biomarkers are selected according to their contribution to model fit, without respect to their marginal-distribution order.
  • the basis step is to consider the empty set of biomarkers to be in the model.
  • a learning machine is generated for each remaining biomarker together with the current set of biomarkers in the model.
  • the remaining biomarker that corresponds to the most accurate learning machine when added to the existing biomarkers is a candidate for sequential addition. As long as a candidate biomarker's contribution to model fit surpasses a threshold, it is added in sequence. This process of sequentially adding biomarkers continues until the best remaining biomarker fails to improve model fit beyond the predetermined threshold.
  • this process starts by selecting an initial universe of biomarkers in steps 1-5, then proceeds to select the ultimate reduced set of biomarkers from this universe according to steps 6, 7 and 8.
  • the reduced set of biomarkers can be derived by changing the initial model defined in step 6 to be the superset defined in step 5 and, instead of adding each biomarker from the superset, remove each biomarker, one by one, and calculate a measure of model fit. Subsequently, change step 7 to remove the biomarker with the least diminishment of a measure of model fit such that the measure of model fit was not diminished by more than a predetermined threshold. Then, follow step 8 where the stopping condition becomes the lack of removal of a biomarker in step 7 as opposed to the lack of an addition of a biomarker in step 7.
  • the above biomarker subset selection algorithm can elucidate the connections and correlations of the biomarkers considered. To achieve this, remove the threshold in step 7 in the above algorithm and store the biomarkers added according to the rank of their marginal improvement in accuracy relative to the model suggested by the previous iteration at each iteration of the algorithm or the increase of the accuracy between each iteration and the iteration preceding it.
  • the methods of classifying data using reduced sets or subsets of biomarkers may be used in any of the methods described herein.
  • the methods of classifying data using reduced numbers of biomarkers described herein may be used in methods for physiological characterization, based in part on a classification according to this invention, and methods of diagnosing lung disease such as non-small cell lung cancer and reactive airway disease ⁇ e.g., asthma).
  • Biomarkers, other than the reduced number of biomarkers, may also be added. These additional biomarkers may or may not contribute to or enhance the diagnosis.
  • biomarkers for use in a diagnostic or prognostic assay may be facilitated using known relationships between particular biomarkers and their first order interactors. Many, if not all, of the biomarkers identified by the present inventors participate in various communications pathways of the cell or the organism. Deviation of one component of a communication pathway from normal is expected to be accompanied by related deviations in other members of the communication pathway.
  • the skilled worker can readily link members of a communication pathway using various databases and available bioinformatics software ⁇ see, e.g., ARIADNE PATHWAY STUDIO®, Ariadne, Inc., ⁇ www.ariadne.genomics.com> or ChEMBL Database, European Bioinformatics Institute, European Molecular Biology Laboratory, ⁇ www.ebi.ac.uk>).
  • a diagnostic method based on determining the levels of a plurality of biomarkers where the plurality of biomarkers includes some biomarkers which are not in the same communication pathway as others in the plurality is likely to maximize the information provided by measuring the biomarker levels.
  • any biomarker in a selected subset may be substituted by another biomarker from the same communications pathway ⁇ i.e., first order interactors of the biomarker).
  • substituting a first order interactor for a biomarker may involve re-training the classification system using the substituted biomeasure.
  • the present invention is directed to methods for physiological characterization, based in part on a classification according to this invention, of individuals in various populations as described below.
  • methods of physiological characterization according to this invention include methods of diagnosing particular lung diseases, methods of predicting the likelihood that an individual will respond to therapeutic intervention, methods of determining whether an individual is at-risk for an individual lung disease, methods for categorizing a patient's degree of severity of disease, and methods for differentiating between diseases having some symptoms in common.
  • these methods rely on determining biomarker measures of particular biomarkers described herein and using these values in a classification system such as a classification system using one or more classifiers or an ensemble of classifiers as described herein to classify individuals according to one of these physiologic characterizations.
  • the invention provides for methods of physiological characterization, based in part on a classification according to this invention, in a subject comprising determining biomarker measures of a plurality of biomarkers in a physiological sample of the subject, where a pattern of expression of the plurality of markers correlate to a physiologic state or condition, or changes in a disease state ⁇ e.g., stages in non-small cell lung cancer) or condition.
  • a pattern of expression of a plurality of biomarkers is indicative of a lung disease such as non-small cell lung cancer or reactive airway disease, or assists in distinguishing between reactive airway disease or non- small cell lung cancer.
  • the plurality of biomarkers are selected based on the analysis of training data via a machine learning algorithm such as a classifier or an ensemble of classifiers as described herein.
  • the training data will include a plurality of biomarkers for numerous subjects, as well as disease categorization information (e.g., j, of Equ. (1)) for the individual subjects, and optionally, other characteristics of the subjects, such as sex, race, age, smoking history, employment history, etc.
  • diseases categorization information e.g., j, of Equ. (1)
  • patterns of expression of biomarkers correlate to an increased likelihood that a subject has or may have a particular disease or condition.
  • methods of determining biomarker measures of a plurality of biomarkers in a subject detect an increase in the likelihood that a subject is developing, has or may have a lung disease such as non-small cell lung cancer or reactive airway disease (e.g., asthma).
  • Patterns of expression may be characterized by any technique known in the art for pattern recognition, such as those described as classifiers and/ or an ensemble of classifiers as describe herein.
  • the plurality of biomarkers may comprise any combination of the biomarkers described in Example 1, Example 10, and/ or Example 16.
  • lung nodules greater than 20 mm in size are likely to have lung cancer, while those with lung nodules less than 6 mm have less, if any, risk.
  • lung nodules of size between 6-20 mm are generally considered "indeterminate.”
  • the method of this invention may be especially useful for those with indeterminate nodules, because it can provide an additional degree of stratification for characterizing those individuals.
  • the present invention provides for enhanced characterization of "at-risk" individuals by determining biomarker measures of relevant biomarkers and providing classification of such individuals according to the models described herein, where the classification may be used by the skilled clinician to supplement other physiological characterizations, based in part on classifications according to this invention, of the individual when developing a comprehensive diagnosis.
  • biomarkers described in the Examples are exemplified by a list of biomarkers described in the Examples. It will be appreciated that subsets of these biomarkers, such as those described in Examples 7-11 and 16, may be used in any of the described embodiments. Biomarker measures for other biomarkers may be included at the discretion of the skilled person, recognizing that including biomarkers in excess of the minimum number of biomarkers needed to provide optimum classifying accuracy may reduce the benefit, due to over-fitting.
  • the invention provides for methods of physiological characterization, based in part on a classification according to this invention, in a male subject comprising determining biomarker measures of a plurality of biomarkers in a physiological sample of the male subject, where a pattern of expression of the plurality of markers correlate to a physiologic state or condition, or changes in a disease state ⁇ e.g., stages in non-small cell lung cancer) or condition.
  • a pattern of expression of a plurality of biomarkers is indicative of a lung disease such as non-small cell lung cancer or reactive airway disease, or assists in distinguishing between reactive airway disease or non-small cell lung cancer.
  • the plurality of biomarkers are selected based on collection of training data comprising biomarker measures for a number of male subjects identified as having the disease state in question and a similar number which are known not to have the disease.
  • This training data is then analyzed by a machine learning algorithm such as a classifier or an ensemble of classifiers as described herein.
  • Patterns of expression may be characterized by any technique known in the art for pattern recognition, such as those described as classifiers and/or an ensemble of classifiers as describe herein.
  • the plurality of biomarkers may comprise any of the combinations of biomarkers described in the Examples such as Examples 1-5, 7-8, or 11.
  • the male subject is at-risk for the lung disease of non-small cell cancer or reactive airway disease ⁇ e.g., asthma, chronic obstructive pulmonary disease, etc).
  • lung disease of non-small cell cancer or reactive airway disease ⁇ e.g., asthma, chronic obstructive pulmonary disease, etc.
  • reactive airway disease e.g., asthma, chronic obstructive pulmonary disease, etc.
  • the invention also provides for a method of physiological characterization, based in part on a classification according to this invention, in a female subject.
  • the invention provides for methods of physiological characterization in a female subject comprising determining biomarker measures of a plurality of biomarkers in a physiological sample of the female subject, where a pattern of expression of the plurality of markers correlate to a physiologic state or condition, or changes in a disease state ⁇ e.g., stages in non-small cell lung cancer) or condition.
  • a pattern of expression of a plurality of biomarkers is indicative of a lung disease such as non-small cell lung cancer or reactive airway disease, or assists in distinguishing between reactive airway disease or non-small cell lung cancer. Methods for these embodiments are similar to those described above, except that the subjects in the training data set are female.
  • the plurality of biomarkers may comprise any of the combinations of biomarkers described in the Examples such as Examples 1-4, 6-7, and 9-11.
  • the female subject is at-risk for the lung disease of non-small cell cancer or reactive airway disease ⁇ e.g., asthma, chronic obstructive pulmonary disease, etc). "At-risk" subjects and individuals are discussed above.
  • the invention provides for various diagnostic and prognostic methods for lung disease.
  • the invention provides methods of diagnosing reactive airway disease and in particular diseases associated with over reactive TH 2 and TH 17 cells.
  • Reactive airway diseases include asthma, chronic obstructive pulmonary disease, allergic rhinitis, cystic fibrosis, bronchitis, or other diseases manifesting hyper-reactivity to various physiological and/ or environmental stimuli.
  • the invention provides for methods of diagnosing asthma and chronic obstructive pulmonary disease, more particularly diagnosing asthma.
  • the invention also provides methods of diagnosing non-small cell lung cancer. These methods include determining biomarker measures of a plurality of biomarkers described herein, wherein the biomarkers are indicative of the presence or development of non-small lung cancer. For example, biomarker measures of biomarkers described herein may be used to assist in determining the extent of progression of non-small lung cancer, the presence of pre-cancerous lesions, or staging of non-small lung cancer.
  • the subject is selected from those individuals who exhibit one or more symptoms of non-small cell lung cancer or reactive airway disease.
  • Symptoms may include cough, shortness of breath, wheezing, chest pain, and hemoptysis; shoulder pain that travels down the outside of the arm or paralysis of the vocal cords leading to hoarseness; invasion of the esophagus may lead to difficulty swallowing. If a large airway is obstructed, collapse of a portion of the lung may occur and cause infections leading to abscesses or pneumonia. Metastases to the bones may produce excruciating pain. Metastases to the brain may cause neurologic symptoms including blurred vision headaches, seizures, or symptoms commonly associated with stroke such as weakness or loss of sensation in parts of the body.
  • Lung cancers often produce symptoms that result from production of hormone-like substances by the tumor cells.
  • a common paraneoplastic syndrome seen in NSCLC is the production parathyroid hormone like substances which cause calcium in the bloodstream to be elevated.
  • Asthma typically produces symptoms such as coughing, especially at night, wheezing, shortness of breath and feelings of chest tightness, pain or pressure.
  • symptoms of asthma are common to NSCLC.
  • the present invention is directed to methods of diagnosing reactive airway disease in individuals in various populations as described below. In general, these methods rely on determining biomarker measures of particular biomarkers as described herein, and classifying the biomarker measures using a classification system that includes a classifier or an ensemble of classifiers as described herein.
  • the invention provides for a method of diagnosing reactive airway disease in a subject comprising, (a) obtaining a physiological sample of the subject; (b) determining biomarker measures of a plurality of biomarkers, as described herein, in said sample; and (c) classifying the sample based on the biomarker measures using a classification system, wherein the classification of the sample is indicative of reactive airway disease in the subject.
  • the invention provides for methods of diagnosing reactive airway disease in a subject comprising determining biomarker measures of a plurality of biomarkers in a physiological sample of the subject, wherein a pattern of expression of the plurality of markers are indicative of reactive airway disease or correlate to changes in a reactive airway disease state.
  • the plurality of the biomarkers are selected based on analysis of training data via a machine learning algorithm such as a classifier or an ensemble of classifiers as described herein.
  • the training data will include a plurality of biomarker measures for numerous subjects, as well as disease categorization for the individual subjects, and optionally, other characteristics of the subjects, such as sex, race, age, smoking history, employment history, etc.
  • patterns of expression correlate to an increased likelihood that a subject has or may have reactive airway disease.
  • Patterns of expression may be characterized by any technique known in the art for pattern recognition, such as those described as classifiers and/or an ensemble of classifiers as describe herein.
  • the plurality of biomarkers may comprise any of the combinations of biomarkers described in Example 1, Example 10 or Example 11.
  • the invention provides for a method of diagnosing reactive airway disease in a male subject. Methods for these embodiments are similar to those described above, except that the subjects are male for both the training data and the sample.
  • the invention provides for a method of diagnosing reactive airway disease in a female subject. Methods for these embodiments are similar to those described above, except that the subjects are female for both the training data and the sample.
  • the present invention is directed to methods of diagnosing non-small cell lung cancer in individuals in various populations as described below. In general, these methods rely on determining biomarker measures of particular biomarkers as described herein, and classifying the biomarker measures using a classification system that includes a classifier or an ensemble of classifiers as described herein.
  • the invention provides for a method of diagnosing non-small cell lung cancer in a subject comprising, (a) obtaining a physiological sample of the subject; (b) determining biomarker measures of a plurality of biomarkers, as described herein, in said sample; and (c) classifying the sample based on the biomarker measures using a classification system, wherein the classification of the sample is indicative of the presence or development of non-small cell lung cancer in the subject.
  • the invention provides for methods of diagnosing non-small cell lung cancer in a subject comprising determining biomarker measures of a plurality of biomarkers in a physiological sample of the subject, wherein a pattern of expression of the plurality of markers are indicative of non-small cell lung cancer or correlate to a changes in a non-small cell lung cancer disease state (i.e., clinical or diagnostic stages).
  • the plurality of the biomarkers are selected based on analysis of training data via a machine learning algorithm such as aclassifier or an ensemble of classifiers as described herein.
  • the training data will include a plurality of biomarker measures for numerous subjects, as well as disease categorization for the individual subjects, and optionally, other characteristics of the subjects, such as sex, race, age, smoking history, employment history, etc.
  • patterns of expression correlate to an increased likelihood that a subject has or may have non-small cell lung cancer. Patterns of expression may be characterized by any technique known in the art for pattern recognition, such as those described as classifiers and/or an ensemble of classifiers as describe herein.
  • the plurality of biomarkers may comprise any of the combinations of biomarkers described in Examples 1, 10, or 17.
  • the subject is at-risk for non-small cell lung cancer.
  • the subject is selected from those individuals who exhibit one or more symptoms of non- small cell lung cancer.
  • the invention provides for a method of diagnosing non-small cell lung cancer in a male subject. Methods for these embodiments are similar to those described above, except that the subjects are male for both the training data and the sample.
  • the invention provides for a method of diagnosing non-small cell lung cancer in a female subject. Methods for these embodiments are similar to those described above, except that the subjects are female for both the training data and the sample.
  • the classification methods of this invention may be used in conjunction with computerized tomography to provide an enhanced procedure for screening and early detection of NSCLC.
  • one of the classification methods described herein is applied to biomarker measures for a plurality of biomarkers in one or more physiological samples from a subject who has at least one lung nodule detected by CT scan.
  • the subject has at least one lung nodule with a diameter between six and twenty mm. Classification of the samples as NSCLC or Normal can assist in the ultimate diagnostic characterization of such patients.
  • the invention provides for methods of treatment based on the output of any of the classification methods described herein.
  • the invention provides for a method of treating a subject for NSCLC following a classification of "NSCLC" using any of the classification methods described herein.
  • the invention includes methods of treatment based on a diagnosis developed using the classification methods described herein in conjunction with additional analysis (e.g., CT scan).
  • the present invention is directed to methods of diagnosing lung disease in individuals in various populations as described below. In general, these methods rely on determining biomarker measures of particular biomarkers that discriminate between the indication of reactive airway disease and non-small cell lung cancer, and classifying the biomarker measures using a classification system including a classifier or an ensemble of classifiers as described herein.
  • the invention provides for a method of diagnosing a lung disease in a subject comprising determining biomarker measures in said subject of a plurality of biomarkers, wherein the biomarker measures of said plurality of biomarkers assists in discriminating between the indication of reactive airway disease and non-small cell lung cancer.
  • the subject has been diagnosed as having reactive airway disease and/or non-small cell lung cancer.
  • the diagnosis may have been determined by the biomarker measures of at least one biomarker in a physiological sample of the subject, where the biomarker measure of the at least one biomarker is indicative of reactive airway disease and/ or non-small cell lung cancer.
  • the invention also provides for a method of diagnosing a lung disease in a subject comprising, (a) obtaining a physiological sample of the subject; (b) determining biomarker measures of a plurality of biomarkers that assist in discriminating between the indication of reactive airway disease and non-small cell lung cancer, a plurality of biomarkers indicative of reactive airway disease, and a plurality of biomarkers indicative of non-small cell lung cancer, as described herein, in said sample, wherein said plurality of biomarkers are not identical; (c) classifying the sample based on the biomarker measures using a classification system, wherein the classification of the sample assists in discriminating between the indication of (i) reactive airway disease and non-small cell lung cancer; (it) presence or absence of reactive airway disease; and (Hi) presence or absence of non-small cell lung cancer in the subject; and (d) determining the subject to have (1) reactive airway disease; (2) non-small cell lung cancer; or (3) absence of disease depending on which condition is found in two of the three classification
  • the plurality of the biomarkers are selected based on analysis of training data via a machine learning algorithm such as a classifier or an ensemble of classifiers as described herein.
  • the training data will include a plurality of biomarker measures for numerous subjects, as well as disease categorization for the individual subjects, and optionally, other characteristics of the subjects, such as sex, race, age, smoking history, employment history, etc.
  • patterns of expression correlate to an increased likelihood that a subject has non-small lung cancer or reactive airway disease. Patterns of expression may be characterized by any technique known in the art for pattern recognition, such as those described as classifiers and/or an ensemble of classifiers as describe herein.
  • the plurality of biomarkers may comprise any of the combinations of biomarkers described in Examples 1, 10, or 17.
  • the subject is at-risk for non-small cell lung cancer and/or reactive airway disease.
  • the subject is selected from those individuals who exhibit one or more symptoms of non-small lung cancer and/ or reactive airway disease.
  • the invention also provides a diagnostic method to assist in differentiating the likelihood that a subject is at-risk of developing or suffering from non-small cell lung cancer or reactive airway disease comprising, (a) obtaining a physiological sample of the subject who is at-risk for non-small cell lung cancer or reactive airway disease; (b) determining the biomarker measures in said subject of a plurality of biomarkers that assist in differentiating the likelihood that said subject is at risk of non- small cell lung cancer or reactive airway disease, as described herein, in said sample; (c) classifying the sample based on the biomarker measures using a classification system, wherein the classification of the sample assists in discriminating between the indication of (i) reactive airway disease and non- small cell lung cancer; (ii) presence or absence of reactive airway disease; and (iii) presence or absence of non-small cell lung cancer in the subject; and (d) determining the subject to be at-risk of developing or suffering from (1) reactive airway disease; (2) non-small cell lung cancer; or (3) absence of
  • the plurality of the biomarkers are selected based on analysis of training data via a machine learning algorithm such as a classifier or an ensemble of classifiers as described herein.
  • the training data will include a plurality of biomarker measures for numerous subjects, as well as disease categorization for the individual subjects, and optionally, other characteristics of the subjects, such as sex, race, age, smoking history, employment history, etc.
  • patterns of expression correlate to an increased likelihood that a subject has non-small lung cancer or reactive airway disease. Patterns of expression may be characterized by any technique known in the art for pattern recognition, such as those described as classifiers and/or an ensemble of classifiers as describe herein.
  • the plurality of biomarkers may comprise any of the combinations of biomarkers described in Examples 1, 10, or 17.
  • the subject is selected from those individuals who exhibit one or more symptoms of non-small lung cancer or reactive airway disease. Methods of relating to "at- risk" subjects are described above and methods related thereto are contemplated herein.
  • the invention provides for a method of diagnosing a lung disease in a male subject. Methods for these embodiments are similar to those described above, except that the subjects are male for both the training data and the sample.
  • the invention provides for a method of diagnosing a lung disease in a female subject. Methods for these embodiments are similar to those described above, except that the subjects are female for both the training data and the sample.
  • the invention also provides a method for designing a system for diagnosing a lung disease in a subject comprising (a) selecting a plurality of biomarkers; (b) selecting a means for determining biomarker measures of said plurality of biomarkers; and (c) designing a system comprising said means for determining biomarker measures and means for analyzing the biomarker measures to determine the likelihood that a subject is suffering from lung disease.
  • the invention also provides a method for designing a system for diagnosing non-small cell lung cancer comprising (a) selecting a plurality of biomarkers; (b) selecting a means for determining the biomarker measures of said plurality of biomarkers; and (c) designing a system comprising said means for determining the biomarker measures and means for analyzing the biomarker measures to determine the likelihood that a subject is suffering from non-small cell lung cancer.
  • the invention also provides a method for designing a system for diagnosing reactive airway disease in a subject comprising (a) selecting a plurality of biomarkers; (b) selecting a means for determining the biomarker measures of said plurality of biomarkers; and (c) designing a system comprising said means for determining the biomarker measures and means for analyzing the biomarker measures to determine the likelihood that a subject is suffering from reactive airway disease.
  • the invention also provides a method for designing a system for diagnosing non-small cell lung cancer or reactive airway disease in a subject comprising (a) selecting a plurality of biomarkers; (b) selecting a means for determining biomarker measures of said plurality of biomarkers; and (c) designing a system comprising said means for determining the biomarker measures and means for analyzing the biomarker measures to determine the likelihood that a subject is suffering from reactive airway disease.
  • the plurality of biomarkers comprises biomarkers indicative of non-small cell lung cancer, biomarkers indicative of reactive airway disease, and biomarkers that assist in discriminating between non-small cell lung cancer and reactive airway disease.
  • steps (b) and (c) may alternatively be performed by (b) selecting detection agents for detecting said plurality of biomarkers, and (c) designing a system comprising said detection agents for detecting plurality of biomarkers.
  • the invention also provides a method for designing a system for assisting in diagnosing a lung disease in a male subject. Methods for these embodiments are similar to those described above.
  • the invention also provides a method for designing a system for assisting in diagnosing a lung disease in a female subject. Methods for these embodiments are similar to those described above.
  • kits comprising means for determining the biomarker measures of a plurality of biomarkers described herein.
  • the invention also provides kits comprising detection agents for detecting a plurality of biomarkers described herein.
  • the plurality of biomarkers may comprise biomarkers indicative of non-small cell lung cancer, biomarkers indicative of reactive airway disease, and/or biomarkers that assist in discriminating between non-small cell lung cancer and reactive airway disease.
  • these biomarkers are reduced sets of biomarkers determined by the methods described herein.
  • the kit provides means for determining biomarker measures for at least the following five biomarkers: CTACK, MCSF, FGF-basic, Survivin, and Eotaxin-3.
  • the kit comprises means for determining at least the following five biomarkers: CTACK, MCSF, FGF- basic, Survivin, and Eotaxin-3, and one or more of the following fifteen biomarkers: Eotaxin, sICAM, IP-10, sE-selectin, sVCAM.l, TNF-b, TNFRl, BDNF, MIG, MIF, IL23-pl9, IL12-p40, Leptin, IL-2ra, and/ or IL-8.
  • the invention also provides a kit comprising, (a) first means for determining the biomarker measures of a plurality of biomarkers indicative of non-small cell lung cancer; and (b) second means for determining the biomarker measures of a plurality of biomarkers indicative of reactive airway disease, wherein said biomarkers in (a) and (b) are not identical.
  • the invention also provides a kit comprising, (a) detection agents for detecting a plurality of biomarkers indicative of non-small cell lung cancer; and (b) detection agents for detecting a plurality of biomarkers indicative of reactive airway disease, wherein said biomarkers in (a) and (b) are not identical.
  • the invention also provides a kit comprising, (a) first means for determining biomarker measures of a plurality of biomarkers indicative of non-small cell lung cancer; (b) second means for determining biomarker measures of a plurality of biomarkers indicative of reactive airway disease; and (c) third means for determining biomarker measures of a plurality of biomarkers that assist in discriminating between non-small cell lung cancer and reactive airway disease, wherein said biomarkers in (a), (b), and (c) are not identical.
  • the invention also provides a kit comprising, (a) detection agents for detecting a plurality of biomarkers indicative of non-small cell lung cancer; (b) detection agents for detecting a plurality of biomarkers indicative of reactive airway disease; and (c) detection agents for detecting a plurality of biomarkers that assist in discriminating between non-small cell lung cancer and reactive airway disease, wherein said biomarkers in (a), (b), and (c) are not identical.
  • kits comprising means for detecting any particular combination of biomarkers described above for any method requiring detection of a particular plurality of biomarkers.
  • the invention provides for systems that assist in performing the methods of the invention.
  • the exemplary classification system comprises a storage device for storing a training data set and/ or a test data set and a computer for executing a learning machine, such as a classifier or and ensemble of classifiers as described herein.
  • the computer may also be operable for collecting the training data set from the database, pre-processing the training data set, training the learning machine using the pre-processed test data set and in response to receiving the test output of the trained learning machine, post-processing the test output to determine if the test output is an optimal solution.
  • Such pre-processing may comprise, for example, visually inspecting the data to detect and remove obviously erroneous entries, normalizing the data by dividing by appropriate standard quantities, and ensuring that the data is in proper form for use in the respective algorithm.
  • the exemplary system may also comprise a communications device for receiving the test data set and the training data set from a remote source.
  • the computer may be operable to store the training data set in the storage device prior to the pre-processing of the training data set and to store the test data set in the storage device prior to the pre-processing of the test data set.
  • the exemplary system may also comprise a display device for displaying the post-processed test data.
  • the computer of the exemplary system may further be operable for performing each additional function described above.
  • the term "computer” is to be understood to include at least one hardware processor that uses at least one memory.
  • the at least one memory may store a set of instructions.
  • the instructions may be either permanently or temporarily stored in the memory or memories of the computer.
  • the processor executes the instructions that are stored in the memory or memories in order to process data.
  • the set of instructions may include various instructions that perform a particular task or tasks, such as those tasks described herein. Such a set of instructions for performing a particular task may be characterized as a program, software program, or simply software.
  • the computer executes the instructions that are stored in the memory or memories to process data.
  • This processing of data may be in response to commands by a user or users of the computer, in response to previous processing, in response to a request by another computer and/ or any other input, for example.
  • the computer used to at least partially implement embodiments may be a general purpose computer.
  • the computer may also utilize any of a wide variety of other technologies including a special purpose computer, a computer system including a microcomputer, mini-computer or mainframe for example, a programmed microprocessor, a micro-controller, a peripheral integrated circuit element, a CSIC (Customer Specific Integrated Circuit) or ASIC (Application Specific Integrated Circuit) or other integrated circuit, a logic circuit, a digital signal processor, a programmable logic device such as a FPGA, PLD, PLA or PAL, or any other device or arrangement of devices that is capable of implementing at least some of the steps of the processes of the invention.
  • each of the processors and/or the memories of the computer may be located in geographically distinct locations and connected so as to communicate in any suitable manner.
  • each of the processor and/or the memory may be composed of different physical pieces of equipment. Accordingly, it is not necessary that the processor be one single piece of equipment in one location and that the memory be another single piece of equipment in another location. That is, it is contemplated, for example, that the processor may be two or more pieces of equipment in two different physical locations. The two or more distinct pieces of equipment may be connected in any suitable manner, such as a network. Additionally, the memory may include two or more portions of memory in two or more physical locations.
  • Various technologies may be used to provide communication between the various computers, processors and/or memories, as well as to allow the processors and/or the memories of the invention to communicate with any other entity; e.g., so as to obtain further instructions or to access and use remote memory stores, for example.
  • Such technologies used to provide such communication might include a network, the Internet, Intranet, Extranet, LAN, an Ethernet, or any client server system that provides communication, for example.
  • Such communications technologies may use any suitable protocol such as TCP/IP, UDP, or OSI, for example.
  • a user interface may be in the form of a dialogue screen.
  • a user interface may also include any of a mouse, touch screen, keyboard, voice reader, voice recognizer, dialogue screen, menu box, list, checkbox, toggle switch, a pushbutton or any other device that allows a user to receive information regarding the operation of the computer as it processes a set of instructions and/or provide the computer with information.
  • a user interface is any device that provides communication between a user and a computer. The information provided by the user to the computer through the user interface may be in the form of a command, a selection of data, or some other input, for example.
  • a user interface of the invention might interact, e.g., convey and receive information, with another computer, rather than a human user. Accordingly, the other computer might be characterized as a user. Further, it is contemplated that a user interface utilized in the system and method of the invention may interact partially with another computer or computers, while also interacting partially with a human user.
  • Example 1 Data Collection and Analysis Using a Support Vector Machine
  • lung cancer is meant to encompass those lung cancers which are known to be non-small celled lung cancers.
  • the research, methodology, and data obtained are described below and presented in WO/2010/105235, which is hereby incorporated by reference in its entirety.
  • CD40 Hepatocyte Growth Factor
  • HGF Hepatocyte Growth Factor
  • I-TAC CXCL11
  • chemokine (C-X-C motif) ligand 11 "interferon- inducible T-cell alpha chemoattractant”
  • Leptin Leptin
  • MMP Matrix Metalloproteinase
  • MMP Matrix Metalloproteinase
  • EFG Epidermal Growth Factor
  • Eotaxin CCLl l
  • Fractalkine Granulocyte Colony Stimulating Factor
  • G-CSF Granulocyte Macrophage Colony Stimulating Factor
  • IFN ⁇ Interferon ⁇
  • IL Interleukin
  • IL Interleukin
  • Plasma specimens for each of the normal, asthma and lung cancer populations were screened for each of the fifty-nine biomarkers by subjecting the plasma specimens to analysis using Luminex's xMAP technology, a quantitative multiplexed immunoassay using automated bead-based technologies.
  • a Panomics' Procarta Cytokine kit (Cat# PC1017) was also used. Antibodies for PAI-1 and Leptin were used from two different kits. Antibodies for PAI-1 A and Leptin 1 were produced by Millipore. Antibodies for PAI-1 B were produced by Panomics.
  • the fluorescence intensity levels resulting from the multiplexed immunoassay were recorded as biomarker measures for each of the fifty-nine biomarkers for each plasma specimen for each population.
  • the recorded fluorescence intensity is proportional to the concentration of the corresponding biomarker in the sample, and also proportional to the extent of its expression in the individual at the time that the sample was collected. Averages, standard deviations, and relative standard deviations for fluorescence intensity level associated with each biomarker for each population were calculated.
  • Figures 1A through 1 C show the average mean, standard deviation and relative standard deviation for each biomarker measure in the normal (NO), non-small cell lung cancer (LC), and asthma (AST) populations, while Figure ID shows the average change between the levels of particular biomarker measures between any two of these populations.
  • FIGS 2A-2C show the average fluorescence intensity level of the biomarkers in the normal (NO), non-small cell lung cancer (LC), and asthma (AST) female population.
  • FIGURE 2D shows the percent change in the mean of each of the biomarker measures in the AST v. NO female populations, LC v. NO female populations, and the AST v. LC female populations.
  • FIGURE 3A-3D The same information with respect to the male population is shown in FIGURE 3A-3D.
  • FIGURE 4 shows the percent change in the mean of each of the biomarker measures in the AST male population compared to the AST female population, the LC male population compared to the LC female population, and the NO male population compared to the NO female population.
  • the data from the Luminex assays was stored electronically in a data storage device with fluorescence intensity data for each biomarker in a particular patient's sample identified with the empirical classification of that patient, based on the physician's diagnosis.
  • the data was pre-processed to make it suitable for use in the model selection algorithm and the support vector machine.
  • the data was randomly partitioned into two groups: a training set and a validation set.
  • the support vector machine algorithm was run for the training data set to generate a model.
  • the data from the validation data set was post-processed through the model generated in the previous step to calculate predicted classification.
  • Predicted classifications were compared to the empirical classifications of the test set samples to calculate measures of model fit such as accuracy, sensitivity, specificity, positive predictive value and negative predictive value, where sensitivity is the probability that a subject is predicted as diseased given that the subject is diseased, specificity is the probability that a subject is predicted as not diseased given that the subject is not diseased, positive predictive value is the probability that a subject is diseased given that the subject is predicted as diseased, negative predictive value is the probability that a subject is not diseased given that the subject is predicted as not diseased, and accuracy is the probability of a correct prediction.
  • the support vector machine was also fit on the data set from Example 1 with asthmatic subjects excluded. Steps 1-5 were carried out, as described in Example 1, for the data set consisting only of cancerous and non-diseased subjects.
  • the resulting support vector machine had a sensitivity of 0.92 (SE: 0.016) and a specificity of 0.92 (SE: 0.015) (see Tables 3, 4).
  • Negative Predictive Value 0.9 (0.017) (0.84, 0.94)
  • Example 1 The data collected in Example 1 from the Luminex assays was again analyzed using steps 1-5 described in Example 1. Data from individual samples was randomly assigned to a new training set and test set. The training set had 398 subjects and the test set had 389 subjects.
  • Cancer 1 64 29 1 93 0.93 iO.OQS) 0.78 0.024) ormal
  • Tata 36S 1 30 236 All 59 biomarkers; females only
  • Examples 1-6 relate to models that include 59 biomarkers. As discussed herein, the number of biomarkers may be reduced without significantly reducing the accuracy of the prediction by using a selection algorithm. A biomarker selection algorithm was run to find the biomarkers to be used in the support vector machine.
  • a 4 biomarker model (EGF, sCD40 ligand, IL-8, and MMP-8) was selected to characterize two of the lung pathology categories (Cancer, Normal).
  • Data from Example 1 was processed according to the five step protocol, except that step 2 pre-processing included excluding all biomeasures other than the four biomarkers chosen by the selection algorithm.
  • the model fit measures showed an accuracy of 95%, sensitivity of 93%, and a specificity of 87%, as described below.
  • Example 7 The process of limiting biomarkers as described in Example 7 was applied to a subset of data from Example 1 which only contained values for male patients.
  • a 5 biomarker model (EGF, IL-8, Sfas, MMP-9, and PAI-1 11 ) was selected to characterize two of the lung pathology categories (Cancer, Normal) in males, with an accuracy of 100%, sensitivity of 100%, and a specificity of 100%, as shown below.
  • Example 7 The process of limiting biomarkers as described in Example 7 was applied to a subset of data from Example 1 which only contained values for female patients.
  • a 3 biomarker model (EGF, sCD40 ligand, IL-8) was selected to characterize two of the lung pathology categories (Cancer, Normal) in females, with an accuracy of 100%, sensitivity of 100%, and a specificity of 100%, as shown below.
  • the data received were raw biomarker concentrations output from the Luminex as described for Example 1.
  • the data output from the Luminex contained fluorescence levels, numbers of events, aggregated fluorescence levels, trimmed fluorescence levels, normalized 12 biomarker concentrations, aggregated normalized biomarker concentrations and trimmed biomarker concentrations.
  • normalized biomarker concentrations were used. Examination of the protein quantifications showed that samples were roughly matched in terms of the total amount of protein and therefore it was unnecessary to normalize 13 the biomarker levels further.
  • Biomarker quantification data was collected for each of the following 86 biomarkers: Brain Derived Neurotrophic Factor ("BDNF”), B Lymphocyte Chemoattractant (“BLC”), Cutaneous T-cell Attracting Chemokine (“CTACK”), Eotaxin-2, Eotaxin-3, Granzyme-B, Hepatocyte Growth Factor (“HGF”), I-TAC ("CXCL11”; “chemokine (C-X-C motif) ligand 11," “interferon-inducible T-cell alpha chemoattractant”), Leptin (“LEP”), Leukemia Inhibiting Factor (“LIF”), Macrophage colony-stimulating factor (“MCSF”), Monokine induced by gamma interferon (“MIG”), Macrophage Inflammatory Protein-3 a (“MIP-3a”), Nerve Growth Factor p("NGF-p”),
  • BDNF Brain Derived Neurotrophic Factor
  • BLC B Lymph
  • normalized means transformed from the observed fluorescence to concentration by matching the observed fluorescence to a concentration on the standard curve.
  • Soluble Ligand (“CD40 Ligand"), Epidermal Growth Factor (“EFG”), Eotaxin (“CCL11”), Fractalkine, Fibroblast Growth Factor Basic (“FGF-basic”), Granulocyte Colony Stimulating Factor (“G-CSF”), Granulocyte Macrophage Colony Stimulating Factor (“GM-CSF”), Interferon ⁇ ("IFN ⁇ "), IFN- ⁇ , IFN-oc2, IFN- ⁇ , Interleukin (“IL”) la, IL- ⁇ , IL-lra, IL-2, IL-2ra, IL-3, IL-4, IL-5, IL-6, IL-7, IL-8, IL-9, IL-10, IL-12(p40), IL-12(p70), IL-13, IL-15, IL-16, IL-17, IL-17a, IL-17F, IL-20, IL
  • Biomarker concentrations above the upper limit of detection were set equal to the upper limit of detection.
  • Biomarker concentrations below the lower limit of detection were set equal to the lower limit of detection and divided by the square root of two.
  • This solution is ad hoc and may not yield an unbiased estimate of the true biomarker distribution. It has the effect of creating a point mass in the distribution of values of the biomarker at the upper or lower limit of detection, as appropriate. Since SVMs are non-parametric and AdaBoost is based on a series of trees, the above mentioned drawbacks of this ad hoc solution do not apply. Gender, age, and smoking were included in every classification model. Results
  • a lung pathology category, y ( SCLC, Normal), and a 86-tuple of continuously distributed biomarkers, x, was available for each of 544 subjects (Cancer: 180, No Cancer: 364) run in triplicate (1634 samples total, Cancer: 546, No Cancer: 1088).
  • the data (y, x) for a sample is referred to as an observation.
  • Models In the current study, Phase 3a used an SVM and AdaBoost. The results presented herein are for models that use all biomarkers and demographic information (544 subjects, 1634 samples with 3 samples per subject, and 86 biomarkers). Subsets and models containing only biomarkers or subsets of the entire panel of biomarkers were also considered. AdaBoost had a superior performance when compared to the SVM and therefore the AdaBoost was explored vigorously.
  • Model performance can be determined by either examining the model's predictions for samples in the validation set or by aggregating sample predictions on the subject level. To aggregate the sample level predictions, a subject was predicted as having cancer if one sample from them was predicted as having cancer. There are other methods to aggregate the data, but in this Example, a method that maximized sensitivity (also known as the true positive rate) and specificity (1-the false positive rate) was chosen.
  • IP- 10 All but one biomarker (IP- 10) exhibited significant variation. Biomarker contrasts with regard to gender on a per sample basis showed that 22 biomarkers exhibited significant variation (Adiponectin, IL.27, IL.2ra, IL.31, LIF, MPO, PIGF, SCF, sE selectin, sFas.Ugand, TNFR.II, ENA.78, Eotaxin, Fractaline, GCSF, GM.CSF, IL.15, I.TAC, Leptin, MlP.lb, Resistin, IL.21).
  • Table 8 Model Performance Table - AdaBoost - All Subjects - By Sample
  • Table 12 Model Performance Table - AdaBoost - Females Only - By Sample
  • Table 13 Model Performance Table - AdaBoost - Females Only - By Sample
  • Table 14 Model Performance Table - SVM - By Subject
  • Table 16 Model Performance Table - AdaBoost - All Subjects - By Subject
  • Table 17 Model Performance Statistics - AdaBoost - All Subjects - By Subject
  • Table 20 Model Performance Table - AdaBoost - Females Only - By Subject
  • Table 21 Model Performance Table - AdaBoost - Females Only - By Subject
  • the Receiver Operating Characteristic (ROC) curve and the area under the curve (AUC) are shown in Figures 1 and 2 for AdaBoost and SVM; the AdaBoost AUC is 0.98 and the SVM AUC is 0.96.
  • the AdaBoost ROC curves for males and for females are shown in Figures 3 and 4.
  • the AUC for males is 0.98 and the AUC for females is 0.95.
  • the AdaBoost variable importance plot is shown in Figure 5; the three most important variables in the AdaBoost model are CTACK, MSCF, and Eotaxin.3.
  • the AdaBoost variable importance plot with restriction to males is shown in Figure 6; the three most important variables were MCSF, CTACK, and Eotaxin.3.
  • the AdaBoost variable importance plot with restriction to females is shown in Figure 7; the three most important variables were MCSF, FGF.basic, and CTACK.
  • an AdaBoost model was constructed as described in Example 10 for both females and males in six configurations using normal smokers and normal previous smokers.
  • the 10 most "important" biomarkers were extracted from six Variable Importance Plots generated for AdaBoost models created for each configuration using algorithms of the statistical software programming language and environment of the R Project. These six lists of 10 biomarkers were then combined manually into a single list containing 15 unique biomarkers. To those 15 unique biomarkers, Smoking Status and Age were added to produce particularly preferred models. (Subjects with unknown smoking status were removed from the database prior to analysis.)
  • NSCLC were: MCSF, Eotaxin-3, CTACK, TNF-b, FGF-basic, Survivin, MIG, TNFRl, IL-23pl9 and IL12-p40.
  • the top 10 biomarkers that differentiate female normal smokers from females with NSCLC were: MCSF, TNF-b, CTACK, Eotaxin-3, Survivin, BDNF, MIF, TNFRl, FGF-basic and Leptin.
  • the top 10 biomarkers that differentiate male normal previous smokers from males with NSCLC were: MCSF, FGF-basic, CTACK, Eotaxin-3, TNF-b, IL-2ra,
  • females with NSCLC were: MCSF, Eotaxin-3, TNF-b, CTACK, FGF-basic, Survivin, IL-2ra, Adiponectin, BDNF and BLC.
  • the top 10 biomarkers that differentiate male normal non-smokers from males with NSCLC were: CTACK, MCSF, Eotaxin-3, FGF-basic, TNF-b, IL-2ra, IL-8, MIP- lb, MIG and PAL 1.
  • the top 10 biomarkers that differentiate female normal non-smokers from females with NSCLC were: Eotaxin-3, CTACK, MCSF, TNF-b, FGF-basic, Survivin, IL- 2ra, IL-8, BDNF and sICAM.
  • the six lists were combined to create a preferred classifier based on 15 unique biomarkers. Priority was given to the lists of biomarkers that differentiate between normal smokers and subjects with NSCLC.
  • the list of biomarkers to be used in the preferred classifier is: MCSF, Eotaxin-3, CTACK, TNF-b, FGF-basic, Survivin, TNFRl, BDNF, MIG, MIF, IL-23pl9, IL12-p40, Leptin, IL-2ra and IL-8. Any of these sets of biomarkers may be used as all or part of a reduced biomarker set for any of the classifiers described herein.
  • Example 12 Diagnostic Test for Non-Small Cell Lung Cancer
  • a sample of a biological fluid is obtained from a patient for whom diagnostic information is desired.
  • the sample is preferably blood serum or plasma.
  • the concentration in the sample of each of the biomarkers from any one of Examples 1-11 is determined.
  • the measured concentration of each biomarker from the sample is inputted into an equation determined using training data in a support vector machine. If the value determined by the equation is positive, it is indicative of non-small cell lung cancer, and if the value is negative, it indicates an absence of non- small cell lung cancer.
  • a sample of a biological fluid is obtained from a male patient for whom diagnostic information is desired.
  • the sample is preferably blood serum or plasma.
  • the concentration in the sample of each of the biomarkers from any one of Examples 1-5, 7-8, 10 or 11 is determined:.
  • the measured concentration of each biomarker from the sample is inputted into an equation determined using training data in a support vector machine. If the value determined by the equation is positive, it is indicative of non-small cell lung cancer, and if the value is negative, it indicates an absence of non-small cell lung cancer.
  • Example 14 Alternative Test for Non-small Cell Lung Cancer in a Male Subject
  • biomarkers described herein participate in communications pathways of the sort described above. Some of the biomarkers are related to each other as first order interactors. Selection of markers for use in a diagnostic or prognostic assay may be facilitated using known relationships between particular biomarkers and their first order interactors. The known communication relationships between HGF (Hepatocyte Growth Factor) and other biomarkers can be seen in Figure 5, generated by the ARIADNE PATHWAY STUDIO®.
  • HGF Hepatocyte Growth Factor
  • Figure 5 shows that first order interactors of HGF (Hepatocyte Growth Factor) include sFasL (soluble Fas ligand), PAI-1 (serpin Plasminogen Activator Inhibitor 1) (active/total), Ins (Insulin; which also includes C-peptide), EGF (Epidermal Growth Factor), MPO (Myeloperoxidase), and MIF (Macrophage Migration Inhibitory Factor).
  • HGF Hepatocyte Growth Factor
  • sFasL soluble Fas ligand
  • PAI-1 serine Plasminogen Activator Inhibitor 1
  • Ins Insulin; which also includes C-peptide
  • EGF Epidermatitis
  • MPO Myeloperoxidase
  • MIF Macrophage Migration Inhibitory Factor
  • a sample of a biological fluid is obtained from a patient for whom diagnostic information is desired.
  • the sample is preferably blood serum or plasma.
  • the concentration in the sample of only selected biomarkers is determined. Assuming HGF is one of the biomarkers selected for use in the support vector machine, then the concentration of any first order interactors of HGF
  • the support vector machine is re-run for training data with the first order interactor substituted for HGF. This model is then applied to the patient sample. If the value determined by the equation is positive, it is indicative of non-small cell lung cancer, and if the value is negative, it indicates an absence of non-small cell lung cancer.
  • a sample of a biological fluid is obtained from a patient for whom diagnostic information is desired.
  • the sample is preferably blood serum or plasma.
  • the concentration in the sample of biomarkers from any one of Examples 1 -11 is determined.
  • the measured concentration of each biomarker from the sample is inputted into an equation determined using training data in a support vector machine. If the value determined by the equation is positive, it is indicative of non- small cell lung cancer, and if the value is negative, it indicates an absence of non-small cell lung cancer.
  • the concentration in the sample of biomarkers from any one of Examples 1-11 is then determined.
  • the measured concentration of each biomarker from the sample is inputted into an equation determined using training data in a support vector machine. If the value determined by the equation is positive, it is indicative of reactive airway disease, and if the value is negative, it indicates an absence of reactive airway disease.
  • the concentration in the sample of biomarkers from any one of Examples 1-10 is then determined.
  • the measured concentration of each biomarker from the sample is inputted into an equation determined using training data in a support vector machine. If the value determined by the equation is positive, it is indicative of non-small cell lung cancer, and if the value is negative, it indicates reactive airway disease.
  • results are further evaluated by analyzing the positive and negative scores.
  • the determination of whether the patient has non-small cell lung cancer, reactive airway disease, or an absence of disease depends on which condition is found in two of the three scores. For example, if the first and third tests are positive, then the patient may be diagnosed as having non-small cell lung cancer. If the first and second tests are negative, then the patient may be diagnosed as not having non-small cell lung cancer or reactive airway disease.
  • Ensemble classifiers that incorporate both an AdaBoost classifier and a Support Vector Machine (SVM) classifier, are a useful way to combine a set of classifiers in order to enhance classification performance.
  • the desired outcome is a classification of cancer or normal.
  • Both the AdaBoost method and SVMs are statistically appropriate methods for classifying subjects into one of these two diagnosis groups.
  • the performance summary statistics of the ensemble classifier indicate that it is a good candidate for classifying samples as cancer or normal.
  • the preferred classifier relies on 15 different biomarkers. While the 15 biomarkers are useful for the current data, this study is to whether the model may be overfitting, and a less complex model may be more appropriate for the general population. Preferably, the samples used to build the models for this invention should be representative of the intended use population in order to obtain the optimal results.
  • the biomarkers included the following: ApoAl, ApoA2, ApoB, ApoC2, ApoE, D-dimer, Factor- VII, Factor- VIII, Factor-X, Protein-C, TPA, MMP-1, MMP-3, MMP-7, MMP-8, MMP-9, MMP-12, MMP-13, Adiponectin, BDNF, BLC, CTACK, Eotaxin-2, Eotaxin-3, Granzyme-B, IFN- g, IL-la, IL-lb, IL-lra, IL-2, IL-4, IL-5, IL-6, IL-7, IL-8, IL-9, IL-13, IL-15, IL-16, IL-17A, IL-17F, IL-20, IL-21, IL-22, IL-23pl9, IL-27, IL-2ra, IL-3, IL-31, LIF, MIF, MPO, OPG, PIGF, SCF, s
  • sample or by entire subject Two different strategies were used to split the data: by sample or by entire subject.
  • sample approach each sample was allowed to be in either dataset (837 samples from 480 subjects in the training data, and 797 samples from 470 subjects in the test data).
  • subject approach all three samples for each subject were forced into either the training data or the test data (819 samples from 274 subjects in the training data, and 815 samples from 271 subjects in the test data). In both methods, however, all three samples were used individually (i.e., there was no averaging of the triplicate samples in order to obtain one overall vector of biomarker values per subject).
  • CTACK CTACK, MCSF, FGF.basic, Survivin, Eotaxin, sICAM,
  • MCSF MCSF, Eotaxin.3, CTACK, FGF-basic, Survivin, IL23-pl9, IL12-p40
  • a logistic regression model was fit to the training data to assign weights to each of the above biomarkers and demographic effects. Two models were developed: one based on the "by sample” method of splitting the data and one based on the "by subject” method of splitting the data (as discussed above).
  • Tables 22 and 23 display the results of the test data where subjects may have had representative samples in both training and test data. There were a total of 837 samples from 480 subjects in the training data, and 797 samples from 470 subjects in the test data.
  • Data were split into training and test data by sample. A subject may have had a sample appear in the training data and another sample appear in the test data.
  • the SE and 95% CI were computed using a normal approximation to the binomial distribution.
  • Data were split into training and test data by sample. A subject may have had a sample appear in the training data and another sample appear in the test data.
  • the SE and 95% CI were computed using a normal approximation to the binomial distribution.
  • Tables 24 and 25 displays the results of the test data where all samples from a given subject were placed into the training data or test data. There were a total of 819 samples from 274 subjects in the training data, and 815 samples from 271 subjects in the test data. Table 24 - Summary Statistics by Sample
  • the top twelve biomarkers found by the logistic selection methods and the lasso were also used to build the classification model (CTACK, MCSF, FGF-basic, Survivin, Eotaxin, sICAM, IP- 10, Eotaxin-3, sE-selectin, sVCAM-1, IL12-p40, IL23-pl9).
  • CTACK Cockayne syndrome
  • MCSF FGF-basic, Survivin
  • Eotaxin sICAM
  • IP- 10 IP- 10
  • Eotaxin-3 sE-selectin
  • sVCAM-1 IL12-p40
  • IL23-pl9 The effects of gender, age, and smoking status were also included in the model.
  • a logistic regression model was fit to the training data to assign weights to each of the above biomarkers and demographic effects. The following results are based on the method of splitting the data "by subject” (819 samples from 274 subjects in the training data, and 815 samples from 271 subjects in the test
  • Table 29 Performance of the methods in the saturated model in the test set (1000 simulations; 104 biomarkers + 3 demographic features).
  • Table 30 Performance of the methods in the reduced model in the test set (1000 simulations; 15 selected biomarkers + 3 demographic features).
  • the study used a simulation dividing the data 1000 times into training and testing sets with half of the subjects randomly assigned to the training set and the remainder assigned to the testing set.
  • Each method was the used to develop a model to discriminate between NSCLC and normal groups on the training set. The performance of each model was assessed using the test set.
  • equations, formulas and relations contained in this disclosure are illustrative and representative and are not meant to be limiting. Alternate equations may be used to represent the same phenomena described by any given equation disclosed herein.
  • the equations disclosed herein may be modified by adding error-correction terms, higher-order terms, or otherwise accounting for inaccuracies, using different names for constants or variables, or using different expressions. Other modifications, substitutions, replacements, or alterations of the equations may be performed.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Epidemiology (AREA)
  • Immunology (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Analytical Chemistry (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Hematology (AREA)
  • Urology & Nephrology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Cell Biology (AREA)
  • Hospice & Palliative Care (AREA)
  • Microbiology (AREA)
  • Oncology (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Biochemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)

Abstract

L'invention porte sur des biomarqueurs et des combinaisons de biomarqueurs utiles dans le diagnostic de maladies pulmonaires, telles qu'un cancer des poumons non à petites cellules ou une affection respiratoire réactionnelle. Les mesures de ces biomarqueurs sont entrées dans un système de classification tel qu'une machine à vecteurs de support ou AdaBoost pour permettre de déterminer la probabilité qu'un individu présente une maladie pulmonaire. L'invention concerne également des trousses comprenant des agents pour la détection des biomarqueurs et d'une combinaison de biomarqueurs, ainsi que des systèmes qui facilitent le diagnostic de maladies pulmonaires.
PCT/US2014/063594 2013-10-31 2014-10-31 Méthodes d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et trousses associées WO2015066564A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361898306P 2013-10-31 2013-10-31
US61/898,306 2013-10-31

Publications (1)

Publication Number Publication Date
WO2015066564A1 true WO2015066564A1 (fr) 2015-05-07

Family

ID=53005237

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/063594 WO2015066564A1 (fr) 2013-10-31 2014-10-31 Méthodes d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et trousses associées

Country Status (1)

Country Link
WO (1) WO2015066564A1 (fr)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815643A (zh) * 2017-01-18 2017-06-09 中北大学 基于随机森林迁移学习的红外光谱模型传递方法
WO2018129414A1 (fr) * 2017-01-08 2018-07-12 The Henry M. Jackson Foundation For The Advancement Of Military Medicine, Inc. Systèmes et procédés d'utilisation d'apprentissage dirigé pour prédire des résultats de pneumonie spécifique à un sujet
CN108763872A (zh) * 2018-04-25 2018-11-06 华中科技大学 一种分析预测癌症突变影响lir模体功能的方法
CN109036571A (zh) * 2014-12-08 2018-12-18 20/20基因系统股份有限公司 用于预测患有癌症的可能性或风险的方法和机器学习系统
CN109243606A (zh) * 2017-11-16 2019-01-18 福建金络康健康科技有限公司 一种基于人体经络穴位生物电信号测量数据的分析系统
CN109409434A (zh) * 2018-02-05 2019-03-01 福州大学 基于随机森林的肝脏疾病数据分类规则提取的方法
CN109564782A (zh) * 2016-08-08 2019-04-02 皇家飞利浦有限公司 基于医院人口统计的电子临床决策支持设备
CN109726826A (zh) * 2018-12-19 2019-05-07 东软集团股份有限公司 随机森林的训练方法、装置、存储介质和电子设备
CN109829471A (zh) * 2018-12-19 2019-05-31 东软集团股份有限公司 随机森林的训练方法、装置、存储介质和电子设备
US20190221316A1 (en) * 2017-04-04 2019-07-18 Lung Cancer Proteomics, Llc Plasma based protein profiling for early stage lung cancer prognosis
CN110706804A (zh) * 2019-08-23 2020-01-17 刘雷 一种混合专家系统在肺腺癌分类中给的应用方法
CN111275707A (zh) * 2020-03-13 2020-06-12 北京深睿博联科技有限责任公司 肺炎病灶分割方法和装置
CN111326260A (zh) * 2020-01-09 2020-06-23 上海中科新生命生物科技有限公司 一种医学分析方法、装置、设备及存储介质
CN111351942A (zh) * 2020-02-25 2020-06-30 北京尚医康华健康管理有限公司 肺癌肿瘤标志物筛选系统及肺癌风险分析系统
CN112185571A (zh) * 2020-09-17 2021-01-05 吾征智能技术(北京)有限公司 一种基于口酸的疾病辅助诊断系统、设备、存储介质
EP3825693A1 (fr) * 2011-04-29 2021-05-26 Cancer Prevention And Cure, Ltd. Procédés d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et leurs kits
CN112970067A (zh) * 2018-06-30 2021-06-15 20/20基因系统股份有限公司 癌症分类器模型、机器学习系统和使用方法
CN113040711A (zh) * 2021-03-03 2021-06-29 吾征智能技术(北京)有限公司 一种脑卒中发病风险预测系统、设备、存储介质
US11474104B2 (en) 2009-03-12 2022-10-18 Cancer Prevention And Cure, Ltd. Methods of identification, assessment, prevention and therapy of lung diseases and kits thereof including gender-based disease identification, assessment, prevention and therapy
CN115994629A (zh) * 2023-03-23 2023-04-21 南京信息工程大学 一种基于gn-rbf的空气湿度预测方法及系统
WO2023172633A1 (fr) * 2022-03-08 2023-09-14 Avalo, Inc. Système et procédé d'association génomique
WO2023230617A3 (fr) * 2022-05-27 2024-01-25 Nonagen Bioscience Corporation Biomarqueurs du cancer de la vessie et méthodes d'utilisation
CN117689219A (zh) * 2024-02-04 2024-03-12 江西科技学院 一种基于机器学习的体育器材安全性评估系统
CN117828488A (zh) * 2024-03-05 2024-04-05 华北电力大学 一种基于随机森林与稳健回归的太阳辐射度预测方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080025591A1 (en) * 2006-07-27 2008-01-31 International Business Machines Corporation Method and system for robust classification strategy for cancer detection from mass spectrometry data
US20100036782A1 (en) * 2006-09-22 2010-02-11 Koninklijke Philips Electronics N. V. Methods for feature selection using classifier ensemble based genetic algorithms
US20120101002A1 (en) * 2008-09-09 2012-04-26 Somalogic, Inc. Lung Cancer Biomarkers and Uses Thereof
WO2012149550A1 (fr) * 2011-04-29 2012-11-01 Cancer Prevention And Cure, Ltd. Procédés d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et leurs kits

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080025591A1 (en) * 2006-07-27 2008-01-31 International Business Machines Corporation Method and system for robust classification strategy for cancer detection from mass spectrometry data
US20100036782A1 (en) * 2006-09-22 2010-02-11 Koninklijke Philips Electronics N. V. Methods for feature selection using classifier ensemble based genetic algorithms
US20120101002A1 (en) * 2008-09-09 2012-04-26 Somalogic, Inc. Lung Cancer Biomarkers and Uses Thereof
WO2012149550A1 (fr) * 2011-04-29 2012-11-01 Cancer Prevention And Cure, Ltd. Procédés d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et leurs kits

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11474104B2 (en) 2009-03-12 2022-10-18 Cancer Prevention And Cure, Ltd. Methods of identification, assessment, prevention and therapy of lung diseases and kits thereof including gender-based disease identification, assessment, prevention and therapy
EP3825693A1 (fr) * 2011-04-29 2021-05-26 Cancer Prevention And Cure, Ltd. Procédés d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et leurs kits
CN109036571A (zh) * 2014-12-08 2018-12-18 20/20基因系统股份有限公司 用于预测患有癌症的可能性或风险的方法和机器学习系统
CN109036571B (zh) * 2014-12-08 2024-03-05 20/20基因系统股份有限公司 用于预测患有癌症的可能性或风险的方法和机器学习系统
CN109564782B (zh) * 2016-08-08 2024-03-08 皇家飞利浦有限公司 基于医院人口统计的电子临床决策支持设备
CN109564782A (zh) * 2016-08-08 2019-04-02 皇家飞利浦有限公司 基于医院人口统计的电子临床决策支持设备
WO2018129414A1 (fr) * 2017-01-08 2018-07-12 The Henry M. Jackson Foundation For The Advancement Of Military Medicine, Inc. Systèmes et procédés d'utilisation d'apprentissage dirigé pour prédire des résultats de pneumonie spécifique à un sujet
CN106815643A (zh) * 2017-01-18 2017-06-09 中北大学 基于随机森林迁移学习的红外光谱模型传递方法
CN106815643B (zh) * 2017-01-18 2019-04-02 中北大学 基于随机森林迁移学习的红外光谱模型传递方法
US20190221316A1 (en) * 2017-04-04 2019-07-18 Lung Cancer Proteomics, Llc Plasma based protein profiling for early stage lung cancer prognosis
JP7250693B2 (ja) 2017-04-04 2023-04-03 ラング キャンサー プロテオミクス, エルエルシー 初期ステージの肺がん診断のための血漿ベースのタンパク質プロファイリング
US11769596B2 (en) * 2017-04-04 2023-09-26 Lung Cancer Proteomics Llc Plasma based protein profiling for early stage lung cancer diagnosis
CN110709936A (zh) * 2017-04-04 2020-01-17 肺癌蛋白质组学有限责任公司 用于早期肺癌预后的基于血浆的蛋白质概况分析
EP3607089A4 (fr) * 2017-04-04 2020-12-30 Lung Cancer Proteomics, LLC Profilage de protéine à base de plasma pour le pronostic précoce du cancer du poumon
JP2020515993A (ja) * 2017-04-04 2020-05-28 ラング キャンサー プロテオミクス, エルエルシー 初期ステージの肺がん診断のための血漿ベースのタンパク質プロファイリング
CN109243606A (zh) * 2017-11-16 2019-01-18 福建金络康健康科技有限公司 一种基于人体经络穴位生物电信号测量数据的分析系统
CN109409434A (zh) * 2018-02-05 2019-03-01 福州大学 基于随机森林的肝脏疾病数据分类规则提取的方法
CN108763872B (zh) * 2018-04-25 2019-12-06 华中科技大学 一种分析预测癌症突变影响lir模体功能的方法
CN108763872A (zh) * 2018-04-25 2018-11-06 华中科技大学 一种分析预测癌症突变影响lir模体功能的方法
CN112970067A (zh) * 2018-06-30 2021-06-15 20/20基因系统股份有限公司 癌症分类器模型、机器学习系统和使用方法
CN109726826B (zh) * 2018-12-19 2021-08-13 东软集团股份有限公司 随机森林的训练方法、装置、存储介质和电子设备
CN109726826A (zh) * 2018-12-19 2019-05-07 东软集团股份有限公司 随机森林的训练方法、装置、存储介质和电子设备
CN109829471B (zh) * 2018-12-19 2021-10-15 东软集团股份有限公司 随机森林的训练方法、装置、存储介质和电子设备
CN109829471A (zh) * 2018-12-19 2019-05-31 东软集团股份有限公司 随机森林的训练方法、装置、存储介质和电子设备
CN110706804A (zh) * 2019-08-23 2020-01-17 刘雷 一种混合专家系统在肺腺癌分类中给的应用方法
CN110706804B (zh) * 2019-08-23 2024-02-02 刘雷 一种混合专家系统在肺腺癌分类中给的应用方法
CN111326260A (zh) * 2020-01-09 2020-06-23 上海中科新生命生物科技有限公司 一种医学分析方法、装置、设备及存储介质
CN111351942B (zh) * 2020-02-25 2024-03-26 北京尚医康华健康管理有限公司 肺癌肿瘤标志物筛选系统及肺癌风险分析系统
CN111351942A (zh) * 2020-02-25 2020-06-30 北京尚医康华健康管理有限公司 肺癌肿瘤标志物筛选系统及肺癌风险分析系统
CN111275707B (zh) * 2020-03-13 2023-08-25 北京深睿博联科技有限责任公司 肺炎病灶分割方法和装置
CN111275707A (zh) * 2020-03-13 2020-06-12 北京深睿博联科技有限责任公司 肺炎病灶分割方法和装置
CN112185571B (zh) * 2020-09-17 2024-01-16 吾征智能技术(北京)有限公司 一种基于口酸的疾病辅助诊断系统、设备、存储介质
CN112185571A (zh) * 2020-09-17 2021-01-05 吾征智能技术(北京)有限公司 一种基于口酸的疾病辅助诊断系统、设备、存储介质
CN113040711B (zh) * 2021-03-03 2023-08-18 吾征智能技术(北京)有限公司 一种脑卒中发病风险预测系统、设备、存储介质
CN113040711A (zh) * 2021-03-03 2021-06-29 吾征智能技术(北京)有限公司 一种脑卒中发病风险预测系统、设备、存储介质
US11810644B2 (en) 2022-03-08 2023-11-07 Avalo, Inc. System and method for genomic association
WO2023172633A1 (fr) * 2022-03-08 2023-09-14 Avalo, Inc. Système et procédé d'association génomique
WO2023230617A3 (fr) * 2022-05-27 2024-01-25 Nonagen Bioscience Corporation Biomarqueurs du cancer de la vessie et méthodes d'utilisation
CN115994629A (zh) * 2023-03-23 2023-04-21 南京信息工程大学 一种基于gn-rbf的空气湿度预测方法及系统
CN117689219A (zh) * 2024-02-04 2024-03-12 江西科技学院 一种基于机器学习的体育器材安全性评估系统
CN117828488A (zh) * 2024-03-05 2024-04-05 华北电力大学 一种基于随机森林与稳健回归的太阳辐射度预测方法
CN117828488B (zh) * 2024-03-05 2024-05-28 华北电力大学 一种基于随机森林与稳健回归的太阳辐射度预测方法

Similar Documents

Publication Publication Date Title
AU2017245307B2 (en) Methods of identification and diagnosis of lung diseases using classification systems and kits thereof
WO2015066564A1 (fr) Méthodes d'identification et de diagnostic de maladies pulmonaires à l'aide de systèmes de classification et trousses associées
US11769596B2 (en) Plasma based protein profiling for early stage lung cancer diagnosis
JP7431760B2 (ja) 癌分類子モデル、機械学習システム、および使用方法
Casserly et al. Multimarker panels in sepsis
US20230263477A1 (en) Universal pan cancer classifier models, machine learning systems and methods of use
US20230176058A1 (en) Markers for the early detection of colon cell proliferative disorders
WO2009015398A1 (fr) Procédés pour la gestion de maladies inflammatoires
US20230223145A1 (en) Methods and software systems to optimize and personalize the frequency of cancer screening blood tests

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14857548

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14857548

Country of ref document: EP

Kind code of ref document: A1