WO2023230321A1 - Machine learning systems and methods for gene set enrichment analysis and scoring - Google Patents

Machine learning systems and methods for gene set enrichment analysis and scoring Download PDF

Info

Publication number
WO2023230321A1
WO2023230321A1 PCT/US2023/023681 US2023023681W WO2023230321A1 WO 2023230321 A1 WO2023230321 A1 WO 2023230321A1 US 2023023681 W US2023023681 W US 2023023681W WO 2023230321 A1 WO2023230321 A1 WO 2023230321A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
cancer
gene sets
treatment
cells
Prior art date
Application number
PCT/US2023/023681
Other languages
French (fr)
Inventor
Jon EARLS
Jeff HIKEN
Ian SCHILLEBEECKX
Original Assignee
Cofactor Genomics, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cofactor Genomics, Inc. filed Critical Cofactor Genomics, Inc.
Publication of WO2023230321A1 publication Critical patent/WO2023230321A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Definitions

  • Cancer is a complex group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. Millions of new cases of cancer occur globally each year. Understanding the immune and tumor profile may help with diagnosis and treatment.
  • the present disclosure discloses a method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition comprising obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and generating a determination indicative of the treatment outcome based on the output.
  • the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
  • the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
  • the plurality of gene sets comprises 1, 2, 3, 4, 5, or 6 gene sets listed in Table 1.
  • the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
  • the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
  • the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database.
  • the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
  • the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
  • the method disclosed herein further comprises obtaining the biological sample of said subject.
  • the biological sample is a solid tumor or liquid biopsy.
  • the biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
  • the biological sample comprises cancer tissue.
  • the cancer tissue comprises tumor-infiltrating immune cells.
  • the biological sample is a mixed sample comprising said cancer tissue and noncancer cells.
  • the method disclosed herein further comprises processing said biological sample to prevent or inhibit tissue degradation.
  • the biological sample is processed into a formalin-fixed paraffin-embedded sample.
  • the method disclosed herein further comprises extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data.
  • the RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
  • the disease or condition is cancer.
  • the cancer is a solid cancer or a hematopoietic cancer.
  • the cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
  • the method disclosed herein further comprises selecting said subject for prediction of said treatment outcome based on said status.
  • the treatment outcome corresponds to one or more cancer treatments.
  • the one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
  • the subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
  • the subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
  • the method disclosed herein further comprises selecting said subject for generating said determination indicative of said treatment outcome based on a current status of said disease or condition.
  • the subject is treated based at least on said determination indicative of said treatment outcome.
  • the subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
  • a computer-implemented system for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition comprising a processor and non-transitory computer readable storage medium comprising instructions that, when executed by the processor, causes the processor to: (i) obtain gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; (ii) conduct a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; (iii) process, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and (iv) generate a determination indicative of the treatment outcome based on the output.
  • the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
  • the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
  • the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1.
  • the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
  • the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
  • the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database.
  • the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
  • the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
  • the processor is configured to obtain the gene expression data for the biological sample of said subject from a database.
  • the biological sample is a solid tumor or liquid biopsy.
  • the biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
  • the biological sample comprises cancer tissue.
  • the cancer tissue comprises tumor-infiltrating immune cells.
  • the biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
  • the biological sample is processed to prevent or inhibit tissue degradation.
  • the biological sample is processed into a formalin-fixed paraffin-embedded sample.
  • the RNA is extracted from said biological sample, an RNA library is generated from said extracted RNA, and RNA-Seq is performed on the RNA library to generate said gene expression data.
  • the RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
  • the disease or condition is cancer.
  • the cancer is a solid cancer or a hematopoietic cancer.
  • the cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
  • the subject is selected for prediction of said treatment outcome based on said status.
  • the treatment outcome corresponds to one or more cancer treatments.
  • the one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
  • the subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
  • the subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
  • the subject is selected for evaluation to generate said determination indicative of said treatment outcome based on a current status of said disease or condition.
  • the subject is treated based at least on said determination indicative of said treatment outcome.
  • the subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
  • a method for generating a trained machine learning model configured to generate a prediction of treatment outcome comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
  • the plurality of biological samples is obtained from said subjects prior to receiving said treatment and said subjects
  • the method further comprises configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
  • FIG. 1 shows a receiver operating characteristic (ROC) curve of false positive rate (FPR) vs. true positive rate (TPR) of a machine learning model trained on a gene set enrichment analysis (GSEA) training set for clinical outcome according to one or more embodiments herein .
  • ROC receiver operating characteristic
  • FPR false positive rate
  • TPR true positive rate
  • GSEA gene set enrichment analysis
  • FIG. 2 shows a graph of training samples across out of bag (OOB) samplings of a GSEA training set for clinical outcome according to one or more embodiments herein;
  • FIG. 3 shows a graph of a percentage of patients that had a response to treatment (disease control rate, DCR) per score division (quartile) of a GSEA training set for clinical outcome according to one or more embodiments herein;
  • FIG. 4 shows a non-limiting example of a computing device; in this case, a device with one or more processors, memory, storage, and a network interface; and
  • FIG. 5 shows a non-limiting example of a workflow for processing a biological sample and using gene set enrichment analysis and machine learning model to predict a response to therapy or a treatment outcome.
  • FIG. 6 shows ROC curves of false positive rate (specificity) vs. true positive rate (sensitivity) for models, a single-sample GSEA (ssGSEA) biomarker model and a PD-L1 biomarker model.
  • the ssGSEA model shows better performance on a Head and Neck Squamous Cell Carcinoma (HNSCC) dataset than does the clinically used PD-L1 biomarker model biomarker model. Shown are the mean OOB prediction scores of each of the samples used to train the model to build a single ROC curve, with a single value for AUC.
  • FIG. 1 and FIG. 6 use slightly different forms of the GSEA biomarker model based on the same dataset.
  • Machine learning models can be trained and used to evaluate enrichment scores derived from gene set enrichment analysis to provide accurate predictions.
  • biomarker genes may be quantified directly and combined with immune cell information to make up a feature set for statistical analysis.
  • the instant disclosure includes the discovery that a computationally simpler and more coherent approach using gene set enrichment analysis can provide accurate predictions without relying on such algorithms for quantifying immune cells within a sample.
  • gene set enrichment scores may be directly used as features in a machine learning model to predict treatment outcome (e.g., response to immunotherapy) without going through an unnecessary intermediate step of deconvolving gene expression data to quantify immune cells and then using the quantified numbers as input features.
  • the systems and methods disclosed herein can provide highly accurate evaluations or determinations indicative of an outcome.
  • performance metrics include accuracy, specificity, sensitivity, positive predictive value, negative predictive value, and receiver operating characteristic/ area under receiver operating characteristic (ROC/AUROC). Any combination of these metrics may be determined for a machine learning model or classifier by testing it against a set of independent samples.
  • True positive (TP) is a positive test result that detects the condition when the condition is present (e.g., positive response to cancer treatment).
  • True negative (TN) is a negative test result that does not detect the condition when the condition is absent.
  • False positive (FP) is a test result that detects the condition when the condition is absent.
  • False negative (FN) is a test result that does not detect the condition when the condition is present.
  • the performance metrics of accuracy, specificity, sensitivity, positive predictive value, and negative predictive value can then be defined according to the following formulas:
  • the AUROC can be determined by creating the ROC curve which entails plotting the true positive rate (TP) against the false positive rate (FP) and varies between 0 and 1.
  • a sample may be evaluated according to the systems and methods disclosed herein to generate an evaluation or determination such as a prediction of treatment outcome that provide a minimum threshold of performance.
  • the analytical algorithm or module e.g., comprising a machine learning model
  • the analytical algorithm or module has an accuracy of at least about 50%, 60%, 70%, 80%, 90%, or 95%.
  • the analytical algorithm or module has a specificity of at least about 50%, 60%, 70%, 80%, 90%, or 95%.
  • the analytical algorithm or module has a sensitivity of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has a PPV of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has an NPV of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has an ROC of at least about 0.6, 0.7, 0.8, 0.85, 0.9, or 0.95 or higher. [0046] In some embodiments, the methods disclosed herein comprise processing a biological sample to obtain gene expression data and performing gene set expression analysis on the gene expression data to generate an evaluation or prediction of outcome.
  • RNA 501 a biological sample is processed to extract RNA 501.
  • the biological sample may be a formalin-fixed paraffin-embedded (FFPE) sample.
  • FFPE formalin-fixed paraffin-embedded
  • the extracted RNA is used to generate an mRNA-Seq library 502.
  • suitable methods may be used for library generation including commercial kits such as, for example, the QuantSeq 3’ mRNA-Seq library prep kit.
  • Next Generation Sequencing is then performed on the library 503.
  • Various suitable platforms may be used for the sequencing, for example, the NextSeq platform by Illumina.
  • Gene Set Enrichment Analysis is performed on the gene expression data generated from the sequencing 504.
  • Various gene sets may be used including independently curated gene sets as well as from public databases such as, for example, gene sets obtained from MSigDB.
  • the gene sets can be derived from various collections such as hallmark gene sets, positional gene sets, curated gene sets, chemical and genetic perturbations, canonical pathways, regulatory target, microRNA targets, transcription factor targets, computational gene sets, cancer gene neighborhoods, cancer modules, ontology gene sets, Gene Ontology derived gene sets, oncogenic signature gene sets, immunologic signature gene sets, cell type signature gene sets, or any combination thereof.
  • Subsets of the canonical pathways gene sets include gene sets derived from BioCarta pathway database, KEGG pathway database, PID pathway database, Reactome pathway database, and WikiPathways pathway database.
  • the ssGSEA can be used to generate an output corresponding to the gene sets that have been evaluated using the gene expression data.
  • the output can be a metric or a score, for example, an enrichment score for each gene set.
  • the machine learning model analyzes the enrichment scores corresponding to the gene sets to predict a response to therapy 505.
  • Non-limiting examples of gene sets suitable for use according to the systems and methods disclosed herein are provided in Table 2.
  • each enrichment score for a gene set forms a feature that makes up part of the input to the trained machine learning model.
  • the response to therapy can be any suitable metric, indicator, or classification.
  • a regression model may output a number between 0 and 1 indicative of responsiveness to therapy.
  • a classifier may generate a classification between two or more categories such as, for example, response to treatment, partial response to treatment, no response to treatment, survival, etc.
  • Various suitable machine learning models can be used.
  • Support vector machine (SVM) is suitable for both regression and classification analysis and can provide a high level of accuracy without requiring significant computing power.
  • PCA principal component analysis
  • the machine learning model is configured to process at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, or 100 gene set metrics (e.g., enrichment scores). In some cases, the machine learning model is configured to process no more than 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, of 500 gene sets. In some cases, each gene set independently comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
  • each gene set independently comprises no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
  • gene set enrichment analysis may be performed on 10 different gene sets, one of which has 10 genes and one of which has 200 genes.
  • the systems and methods disclosed herein utilize 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 gene sets, each of which independently comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes and/or no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes.
  • one or more of the gene sets used as features in the predictive model do not utilize the full list of genes within a known gene set. For example, a less than 100% fraction of the genes in a given gene set may be used for calculating an output or metric for that gene set. In some cases, for a given gene set (such as any one or more of those listed in Table 2), at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the genes listed for the gene set are used to calculate the output or metric for the gene set.
  • gene set 1 (“HALLMARK EPITHELIAL MESENCHYMAL TRANSITION”) from Table 2 includes 200 genes associated with epithelial to mesenchymal transition.
  • the gene set enrichment analysis performed according to the systems and methods disclosed herein may utilize 50% of the genes in this gene set with respect to epithelial to mesenchymal transition in combination with certain independently determined percentages of other gene sets in Table 1.
  • any combination of genes within each gene set may be used for gene set enrichment analysis to generate a corresponding output metric such as an enrichment score. Then the output for a plurality of gene sets can be used as input features provided to a machine learning algorithm or model to generate a composite score indicative of a prediction such as an outcome or treatment outcome.
  • the identities of the genes making up each gene set listed in Table 2 can be found on the publicly accessible database MSigDB and are also listed in Table 3, which shows the gene member identification used by MSigDB alongside the corresponding NCBI Gene ID and Gene Symbol.
  • the output can make up the features that are processed using an algorithm such as a trained model generated using machine learning to generate an evaluation such as a predicted treatment outcome.
  • an algorithm such as a trained model generated using machine learning to generate an evaluation such as a predicted treatment outcome.
  • a predicted treatment outcome from a sample of a subject.
  • the subject has or is suspected of having a disease or disorder.
  • the disease or disorder can be a cancer.
  • the predicted treatment outcome is for an immunotherapy targeting a cancer.
  • the methods disclosed herein comprise obtaining a sample from a subject.
  • the sample is any fluid or other material derived from the body of a normal or disease subject including, but not limited to, blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, milk, amniotic fluid, bile, ascites fluid, organ or tissue extract, and culture fluid in which any cells or tissue preparation from a subject has been incubated.
  • the sample is obtained from skin, blood, brain, bladder, bone, bone marrow, breast, colon, stomach, esophagus, ovary, uterus, gallbladder, fallopian tube, testicle, kidney, liver, pancreas, adrenal gland, cervix, endometrium, head or neck, lung, prostate, thymus, thyroid, lymph node, or urinary bladder.
  • the sample is a cancer sample or biopsy.
  • the cancer sample is typically a solid tumor sample or a liquid tumor sample.
  • the cancer sample can be obtained from excised tissue.
  • the samples is fresh, frozen, or fixed.
  • a fixed sample comprises paraffin-embedded or fixation by formalin, formaldehyde, or gluteraldehyde.
  • the sample is formalin-fixed paraffin-embedded.
  • the sample is stored after it has been collected, but before additional steps are to be performed. In some instances, the sample is stored at less than 8° C. In some instances, the sample is stored at less than 4° C. In some instances, the sample is stored at less than 0° C. In some instances, the sample is stored at less than -20° C. In some instances, the sample is stored at less than -70° C. In some instances, the sample is stored a solution comprising glycerol, glycol, dimethyl sulfoxide, growth media, nutrient broth or any combination thereof. The sample may be stored for any suitable period of time. In some instances, the sample is stored for any period of time and remains suitable for downstream applications.
  • the sample is stored for any period of time before nucleic acid (e.g., ribonucleic acid (RNA) or deoxyribonucleic acid (DNA)) extraction.
  • nucleic acid e.g., ribonucleic acid (RNA) or deoxyribonucleic acid (DNA)
  • the sample is stored for at least or about 1 day, 2 day, 3 days, 4 days, 5 days, 6 days, 7 days, 1 week, 2 weeks, 3 weeks, 4 weeks, 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, 7 months, 8 months, 9 months, 10 months, 11 months, 12 months, or more than 12 months.
  • the sample is stored for at least 1 year, 2 years, 3, years, 4 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years, 11 years, 12 years, or more than 12 years.
  • Methods and systems as described herein comprise generating an immune-oncology profile from a sample of a subject, wherein the sample comprises a nucleic acid molecule.
  • the nucleic acid molecule is RNA, DNA, fragments, or combinations thereof.
  • the sample is processed further before analysis.
  • the sample is processed to extract the nucleic acid molecule from the sample.
  • no extraction or processing procedures are performed on the sample.
  • the nucleic acid is extracted using any technique that does not interfere with subsequent analysis. Extraction techniques include, for example, alcohol precipitation using ethanol, methanol or isopropyl alcohol. In some instances, extraction techniques use phenol, chloroform, or any combination thereof.
  • extraction techniques use a column or resin based nucleic acid purification scheme such as those commonly sold commercially.
  • the nucleic acid molecule is purified.
  • the nucleic acid molecule is further processed.
  • RNA is further reverse transcribed to cDNA.
  • processing of the nucleic acid comprises amplification.
  • the nucleic acid is stored in water, Tris buffer, or Tris-EDTA buffer before subsequent analysis.
  • the sample is stored at less than 8° C. In some instances, the sample is stored at less than 4° C. In some instances, the sample is stored at less than 0° C.
  • a nucleic acid molecule obtained from a sample comprises may be characterized by factors such as integrity of the nucleic acid molecule or size of the nucleic acid molecule. In some instances, the nucleic acid molecule is DNA.
  • the nucleic acid molecule is RNA.
  • the RNA or DNA comprises a specific integrity.
  • the RNA integrity number (RIN) of the RNA is no more than about 2.
  • the RNA molecules in a sample have a RIN of about 2 to about 10.
  • the RNA molecules in a sample have a RIN of at least about 2.
  • the RNA molecules in a sample have a RIN of at most about 10.
  • the RNA molecules in a sample have a RIN of about 2 to about 3, about 2 to about 4, about 2 to about 5, about 2 to about 6, about 2 to about 7, about 2 to about 8, about 2 to about 9, about 2 to about 10, about 3 to about 4, about 3 to about 5, about 3 to about 6, about 3 to about 7, about 3 to about 8, about 3 to about 9, about 3 to about 10, about 4 to about 5, about 4 to about 6, about 4 to about 7, about 4 to about 8, about 4 to about 9, about 4 to about 10, about 5 to about 6, about 5 to about 7, about 5 to about 8, about 5 to about 9, about 5 to about 10, about 6 to about 7, about 6 to about 8, about 6 to about 9, about 6 to about 10, about 7 to about 8, about 7 to about 9, about 7 to about 10, about 8 to about 9, about 8 to about 10, or about 9 to about 10.
  • the RNA molecule in a sample may be characterized by size. In some instances, at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, or 90%, or more of the RNA molecules in a sample are at least 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, or more than 400 nucleotides in size. In some instances, the RNA molecules in the sample are at least 200 nucleotides in size. In some instances, the RNA molecules of at least 200 nucleotides in size comprise a percentage of the sample (DV200).
  • the percentage is at least or about 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or more than 95%.
  • the RNA molecules in a sample have a DV200 value of about 10% to about 90%. In some instances, the RNA molecules in a sample have a DV200 value of at least about 10%. In some instances, the RNA molecules in a sample have a DV200 value of at most about 90%.
  • the RNA molecules in a sample have a DV200 value of about 10% to about 20%, about 10% to about 30%, about 10% to about 40%, about 10% to about 50%, about 10% to about 60%, about 10% to about 70%, about 10% to about 80%, about 10% to about 90%, about 20% to about 30%, about 20% to about 40%, about 20% to about 50%, about 20% to about 60%, about 20% to about 70%, about 20% to about 80%, about 20% to about 90%, about 30% to about 40%, about 30% to about 50%, about 30% to about 60%, about 30% to about 70%, about 30% to about 80%, about 30% to about 90%, about 40% to about 50%, about 40% to about 60%, about 40% to about 70%, about 40% to about 80%, about 40% to about 90%, about 50% to about 60%, about 50% to about 70%, about 50% to about 80%, about 50% to about 90%, about 60% to about 70%, about 60% to about 80%, about 60% to about 90%, about 70% to about 80%, about 70% to about 90%, or about 80% to about 90%.
  • the nucleic acid molecule is prepared for sequencing.
  • a sequencing library is prepared. Numerous library generation methods have been described.
  • methods for library generation comprise addition of a sequencing adapter. Sequencing adapters may be added to the nucleic acid molecule by ligation.
  • library generation comprises an end-repair reaction.
  • library generation for sequencing comprises an enrichment step. For example, coding regions of the mRNA are enriched. In some instances, the enrichment step is for a subset of genes. In some instances, the enrichment step comprises using a bait set.
  • the bait set may be used to enrich for genes used for specific downstream applications.
  • a bait set generally refers to a set of baits targeted toward a selected set of genomic regions of interest. For example, a bait set may be selected for genomic regions relating to at least one of immune modulatory molecule expression, cell type and ratio, or mutational burden. In some instances, one bait set is used for determining immune modulatory molecule expression, a second bait set is used for determining cell type and ratio, and a third bait set is used for determining mutational burden.
  • a bait set comprises at least one unique molecular identifier (UMI).
  • UMI unique molecular identifier
  • UMI unique molecular identifier
  • UMI refers to nucleic acid having a sequence which can be used to identify and/or distinguish one or more first molecules to which the UMI is conjugated from one or more second molecules.
  • the UMI is conjugated to one or more target molecules of interest or amplification products thereof.
  • UMIs may be single or double stranded.
  • the systems and methods disclosed herein provide for the sequencing for a number of genes.
  • the number of genes is at least about 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, or more than 10000 genes.
  • the number of genes to be sequenced is in a range of about 500 to about 1000 genes.
  • the number of genes to be sequenced is in a range of about at least 200.
  • the number of genes to be sequenced is in a range of about at most 10,000.
  • the number of genes to be sequenced is in a range of about 200 to 500, 200 to 1,000, 200 to 2,000, 200 to 4,000, 200 to 6,000, 200 to 8,000, 200 to 10,000, 500 to 1,000, 500 to 2,000, 500 to 4,000, 500 to 6,000, 500 to 8,000, 500 to 10,000, 1,000 to 2,000, 1,000 to 4,000, 1,000 to 6,000, 1,000 to 8,000, 1,000 to 10,000, 2,000 to 4,000, 2,000 to 6,000, 2,000 to 8,000, 2,000 to 10,000, 4,000 to 6,000, 4,000 to 8,000, 4,000 to 10,000, 6,000 to 8,000, 6,000 to 10,000, or 8,000 to 10,000.
  • Sequencing may be performed with any appropriate sequencing technology.
  • sequencing methods include, but are not limited to single molecule real-time sequencing, Polony sequencing, sequencing by ligation, reversible terminator sequencing, proton detection sequencing, ion semiconductor sequencing, nanopore sequencing, electronic sequencing, pyrosequencing, Maxam-Gilbert sequencing, chain termination (e.g., Sanger) sequencing, +S sequencing, or sequencing by synthesis.
  • Sequencing methods may include, but are not limited to, one or more of: high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq (Illumina), Digital Gene Expression (Helicos), Next generation sequencing, Single Molecule Sequencing by Synthesis (SMSS) (Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, Maxim-Gilbert sequencing, and primer walking. Sequencing may generate sequencing reads (“reads”), which may be processed (e.g., alignment) to yield longer sequences, such as consensus sequences.
  • reads sequencing reads
  • An average read length from sequencing may vary.
  • the average read length is at least about 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, or more than 80000 base pairs.
  • the average read length is in a range of about 100 to 80,000.
  • the average read length is in a range of about at least 100.
  • the average read length is in a range of about at most 80,000.
  • the average read length is in a range of about 100 to 200, 100 to 300, 100 to 500, 100 to 1,000, 100 to 2,000, 100 to 4,000, 100 to 8,000, 100 to 10,000, 100 to 20,000, 100 to 40,000, 100 to 80,000, 200 to 300, 200 to 500, 200 to 1,000, 200 to 2,000, 200 to 4,000, 200 to 8,000, 200 to 10,000, 200 to 20,000, 200 to 40,000, 200 to 80,000, 300 to 500, 300 to 1,000, 300 to 2,000, 300 to 4,000, 300 to 8,000, 300 to 10,000, 300 to 20,000, 300 to 40,000, 300 to 80,000, 500 to 1,000, 500 to 2,000, 500 to 4,000, 500 to 8,000, 500 to 10,000, 500 to 20,000, 500 to 40,000, 500 to 80,000, 1,000 to 2,000, 1,000 to 4,000, 1,000 to 8,000, 1,000 to 10,000, 1,000 to 20,000, 1,000 to 40,000, 1,000 to 80,000, 2,000 to 4,000, 2,000 to 8,000, 2,000 to 10,000, 2,000 to 20,000, 40,000, 1,000 to 80,000, 2,000
  • a number of nucleotides that are sequenced are at least or about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500, 2000, 2500, 3000, or more than 3000 nucleotides. In some instances, the number of nucleotides that are sequenced are about 5 to about 3,000 nucleotides. In some instances, the number of that are sequenced are at least 5 nucleotides. In some instances, the number of nucleotides that are sequenced are at most 3,000 nucleotides.
  • the number of nucleotides that are sequenced are 5 to 50, 5 to 100, 5 to 200, 5 to 400, 5 to 600, 5 to 800, 5 to 1,000, 5 to 1,500, 5 to 2,000, 5 to 2,500, 5 to 3,000, 50 to 100, 50 to 200, 50 to 400, 50 to 600, 50 to 800, 50 to 1,000, 50 to 1,500, 50 to 2,000, 50 to 2,500, 50 to 3,000, 100 to 200, 100 to 400, 100 to 600, 100 to 800, 100 to 1,000, 100 to 1,500, 100 to 2,000, 100 to 2,500, 100 to 3,000, 200 to 400, 200 to 600, 200 to 800, 200 to 1,000, 200 to 1,500, 200 to 2,000, 200 to 2,500, 200 to 3,000, 400 to 600, 400 to 800, 400 to 1,000, 400 to 1,500, 400 to 2,000, 400 to 2,500, 400 to 3,000, 600 to 800, 600 to 1,000, 400 to 1,500, 400 to 2,000, 400 to 2,500, 400 to 3,000, 600 to 800, 600 to 1,000, 400 to 1,500,
  • Sequencing methods may include a barcoding or “tagging” step.
  • barcoding (or “tagging”) can allow for generation of a population of samples of nucleic acids, wherein each nucleic acid can be identified from which sample the nucleic acid originated.
  • the barcode comprises oligonucleotides that are ligated to the nucleic acids.
  • the barcode is ligated using an enzyme, including but not limited to, E. coli ligase, T4 ligase, mammalian ligases (e.g., DNA ligase I, DNA ligase II, DNA ligase III, DNA ligase IV), thermostable ligases, and fast ligases.
  • Barcoding or tagging may occur using various types of barcodes or tags.
  • barcodes or tags include, but are not limited to, a radioactive barcode or tag, a fluorescent barcode or tag, an enzyme, a chemiluminescent barcode or tag, and a colorimetric barcode or tag.
  • the barcode or tag is a fluorescent barcode or tag.
  • the fluorescent barcode or tag comprises a fluorophore.
  • the fluorophore is an aromatic or heteroaromatic compound.
  • the fluorophore is a pyrene, anthracene, naphthalene, acridine, stilbene, benzoxaazole, indole, benzindole, oxazole, thiazole, benzothiazole, canine, carbocyanine, salicylate, anthranilate, xanthenes dye, coumarin.
  • xanthene dyes include, e.g., fluorescein and rhodamine dyes.
  • Fluorescein and rhodamine dyes include, but are not limited to, 6-carboxyfluorescein (FAM), 2'7'-dimethoxy- 4'5'-dichloro-6-carboxyfluorescein (JOE), tetrachlorofluorescein (TET), 6-carboxyrhodamine (R6G), N,N,N,N'-tetramethyl-6-carboxyrhodamine (TAMRA), 6-carboxy-X-rhodamine (ROX).
  • FAM 6-carboxyfluorescein
  • JE 2'7'-dimethoxy- 4'5'-dichloro-6-carboxyfluorescein
  • TET tetrachlorofluorescein
  • R6G 6-carboxyrhodamine
  • TAMRA 6-carboxy-X-
  • the fluorescent barcode or tag also includes the naphthylamine dyes that have an amino group in the alpha or beta position.
  • naphthylamino compounds include l-dimethylaminonaphthyl-5-sulfonate, l-anilino-8-naphthalene sulfonate and 2-p-toluidinyl-6- naphthalene sulfonate, 5-(2'-aminoethyl)aminonaphthalene-l -sulfonic acid (EDANS).
  • Examples of coumarins include, e.g., 3-phenyl-7-isocyanatocoumarin; acridines, such as 9- isothiocyanatoacridine and acridine orange; N-(p-(2-benzoxazolyl)phenyl) maleimide; cyanines, such as, e.g., indodi carbocyanine 3 (Cy3), indodicarbocyanine 5 (Cy5), indodicarbocyanine 5.5 (Cy5.5), 3-(-carboxy-pentyl)-3'-ethyl-5,5'-dimethyloxacarbocyanine (CyA); 1H, 5H, 11H, 15H- Xantheno[2,3, 4-ij : 5,6,7-i'j ']diquinolizin-l 8-ium, 9-[2 (or 4)-[[[6-[2,5-dioxo-l- pyrroli
  • barcode lengths include barcode sequences comprising, without limitation, about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25 or more bases in length.
  • barcode lengths include barcode sequences comprising, without limitation, from 1-5, 1-10, 5-20, or 1-25 bases in length. Barcode systems may be in base 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 or a similar coding scheme.
  • a number of barcodes is at least or about 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000, 18000, 20000, 25000, 30000, 40000, 50000, 100000, 500000, 1000000, or more than 1000000 barcodes. In some instances, a number of barcodes is in a range of 1-1000000 barcodes.
  • the number of barcodes is in a range of about 1-10 1-50 1-100 1-500 1-1000 1-5,000 1-10000 1-50000 1-100000 1-500000 1-1000000 10-50 10-100 10-500 10-1000 10-5,000 10-10000 10-50000 10-100000 10-500000 10-1000000 50-100 50-500 50-1000 50-5,000 50-10000 50-50000 50-100000 50-500000 50-1000000 100- 500 100-1000 100-5,000 100-10000 100-50000 100-100000 100-500000 100-1000000 500- 1000 500-5,000 500-10000 500-50000 500-100000 500-500000 500-1000000 1000-5,000 1000- 10000 1000-50000 1000-100000 1000-500000 1000-1000-1000000 5,000-10000 5,000-50000 5,000- 100000 5,000-500000 5,000-1000000 10000 10000-100000 10000-500000 10000- 1000000 50000-100000 50000-500000 50000-1000000 100000-500000 100000-1000000 or 500000-1000000 barcodes.
  • GSEA Gene Set Enrichment Analysis
  • a predefined set of genes may be evaluated to produce an output or metric such as, for example, a score corresponding to the difference between two or more categories or biological states.
  • Multiple sets of genes can be evaluated to generate multiple such outputs or metrics.
  • These outputs or metrics may comprise the features of a model such as a trained machine learning model configured to generate predictions with respect to the categories or biological state.
  • the model may be a regression that generates an output along a continuum (e.g., any value between 0 and 1) or a classifier which generates a classification for a data set.
  • the sample often comprises a heterogeneous composition of different cell types and/or subtypes.
  • the sample is a tumor sample.
  • the cell types and/or subtypes that make up the sample includes one or more of cancer cells, non-cancer cells, and/or immune cells.
  • non-immune cells examples include salivary gland cells, mammary gland cells, lacrimal gland cells, ceruminous gland cells, eccrine sweat gland cells, apocrine sweat gland cells, sebaceous gland cells, Bowman's gland cells, Brunner's gland cells, prostate gland cells, seminal vesicle cells, bulbourethral gland cells, keratinizing epithelial cells, hair shaft cells, epithelial cells, exocrine secretory epithelial cells, uterus endometrium cells, isolated goblet cells of respiratory and digestive tracts, stomach lining mucous cells, hormone secreting cells, pituitary cells, gut and respiratory tract cells, thyroid gland cells, adrenal gland cells, chromaffin cells, Leydig cells, theca interna cells, macula densa cells of kidney, peripolar cells of kidney, mesangial cells of kidney, hepatocytes, white fat cells, brown fat cells, liver lipocytes, kidney cells, kidney glomerulus parietal
  • lymphoid cells include, but are not limited to, CD4+ memory T- cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells, effector memory T (Tern) cells, CD4+ Tcm, CD4+ Tern, CD8+ T-cells, CD8+ naive T-cells, CD8+ Tcm, CD8+ Tem, regulatory T cells (Tregs), T helper (Th) 1 cells, Th2 cells, gamma delta T (Tgd) cells, natural killer (NK) cells, natural killer T (NKT) cells, B-cells, naive B-cells, memory B-cells, cl ass- switched memory B-cells, pro B-cells, and plasma cells.
  • lymphoid cells include, but are not limited to, CD4+ memory T- cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells
  • the cells are stromal cells, for example, mesenchymal stem cells, adipocytes, preadipocytes, stromal cells, fibroblasts, pericytes, endothelial cells, microvascular endothelial cells, lymphatic endothelial cells, smooth muscle cells, chondrocytes, osteoblasts, skeletal muscle cells, myocytes.
  • stromal cells for example, mesenchymal stem cells, adipocytes, preadipocytes, stromal cells, fibroblasts, pericytes, endothelial cells, microvascular endothelial cells, lymphatic endothelial cells, smooth muscle cells, chondrocytes, osteoblasts, skeletal muscle cells, myocytes.
  • stem cells include, but are not limited to, hematopoietic stem cells, common lymphoid progenitor cells, common myeloid progenitor cells, granulocyte-macrophage progenitor cells, megakaryocyte-erythroid progenitor cells, multipotent progenitor cells, megakaryocytes, erythrocytes, and platelets.
  • myeloid cells include, but are not limited to, monocytes, macrophages, macrophages Ml, macrophages M2, dendritic cells, conventional dendritic cells, plasmacytoid dendritic cells, immature dendritic cells, neutrophils, eosinophils, mast cells, and basophils.
  • the sequencing data comprises genes that are differentially expressed by various immune cell types.
  • immune cells to be detected by methods described herein include, but are not limited to, CD4+ memory T-cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells, effector memory T (Tern) cells, CD4+ Tcm, CD4+ Tern, CD8+ T-cells, CD8+ naive T-cells, CD8+ Tcm, CD8+ Tern, regulatory T cells (Tregs), T helper (Th) 1 cells, Th2 cells, gamma delta T (Tgd) cells, natural killer (NK) cells, natural killer T (NKT) cells, B-cells, naive B-cells, memory B-cells, cl ass- switched memory B-cells, pro B-cells, and plasma cells.
  • Tregs regulatory T cells
  • Th2 cells Th2 cells
  • Tgd gamma delta T (Tgd) cells
  • NK natural killer
  • NKT natural killer
  • the term “about” refers to an amount that is near the stated amount by 10%, 5%, or 1%, including increments therein.
  • the term “about” in reference to a percentage refers to an amount that is greater or less the stated percentage by 10%, 5%, or 1%, including increments therein.
  • each of the expressions “at least one of A, B and C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.
  • RNA refers to a molecule comprising at least one ribonucleotide residue.
  • RNA may include transcripts.
  • ribonucleotide is meant a nucleotide with a hydroxyl group at the 2’ position of a beta-D-ribo-furanose moiety.
  • RNA includes, but not limited to, mRNA, ribosomal RNA, tRNA, non-protein-coding RNA (npcRNA), non-messenger RNA, functional RNA (fRNA), long non-coding RNA (IncRNA), pre-mRNAs, and primary miRNAs (pri-miRNAs).
  • RNA includes, for example, double-stranded (ds) RNAs; single-stranded RNAs; and isolated RNAs such as partially purified RNA, essentially pure RNA, synthetic RNA, recombinant RNA, as well as altered RNA that differ from naturally-occurring RNA by the addition, deletion, substitution and/or alteration of one or more nucleotides.
  • alterations can include addition of non-nucleotide material, such as to the end(s) of the siRNA or internally, for example at one or more nucleotides of the RNA.
  • Nucleotides in the RNA molecules described herein can also comprise non-standard nucleotides, such as non-naturally occurring nucleotides or chemically synthesized nucleotides or deoxynucleotides. These altered RNAs can be referred to as analogs or analogs of naturally- occurring RNA. [0080] Unless specifically stated or obvious from context, as used herein, the term “about” in reference to a number or range of numbers is understood to mean the stated number and numbers +/- 10% thereof, or 10% below the lower listed limit and 10% above the higher listed limit for the values listed for a range.
  • sample generally refers to a biological sample of a subject.
  • the biological sample may be a tissue or fluid of the subject, such as blood (e.g., whole blood), plasma, serum, urine, saliva, mucosal excretions, sputum, stool and tears.
  • the biological sample may be derived from a tissue or fluid of the subject.
  • the biological sample may be a tumor sample or heterogeneous tissue sample.
  • the biological sample may have or be suspected of having disease tissue.
  • the tissue may be processed to obtain the biological sample.
  • the biological sample may be a cellular sample.
  • the biological sample may be a cell-free (or cell free) sample, such as cell-free DNA or RNA.
  • the biological sample may comprise cancer cells, non-cancer cells, immune cells, non-immune cells, or any combination thereof.
  • the biological sample may be a tissue sample.
  • the biological sample may be a liquid sample.
  • the liquid sample can be a cancer or non-cancer sample.
  • Non-limiting examples of liquid biological samples include synovial fluid, whole blood, blood plasma, lymph, bone marrow, cerebrospinal fluid, serum, seminal fluid, urine, and amniotic fluid.
  • variant generally refers to a genetic variant, such as an alteration, variant or polymorphism in a nucleic acid sample or genome of a subject. Such alteration, variant or polymorphism can be with respect to a reference genome, which may be a reference genome of the subject or other individual.
  • Single nucleotide polymorphisms are a form of polymorphisms.
  • one or more polymorphisms comprise one or more single nucleotide variations (SNVs), insertions, deletions, repeats, small insertions, small deletions, small repeats, structural variant junctions, variable length tandem repeats, and/or flanking sequences.
  • Copy number variants (CNVs), transversions and other rearrangements are also forms of genetic variation.
  • a genomic alternation may be a base change, insertion, deletion, repeat, copy number variation, or transversion.
  • the term “subject,” as used herein, generally refers to an animal, such as a mammalian species (e.g., human) or avian (e.g., bird) species, or other organism, such as a plant. More specifically, the subject can be a vertebrate, a mammal, a mouse, a primate, a simian or a human. Animals include, but are not limited to, farm animals, sport animals, and pets.
  • the subject can be a healthy individual, an individual that has or is suspected of having a disease or a predisposition to the disease, or an individual that is in need of therapy or suspected of needing therapy.
  • the subject can be a patient.
  • the subject may have or be suspected of having a disease.
  • FIG. 4 a block diagram is shown depicting an exemplary machine that includes a computer system 400 (e.g., a processing or computing system) within which a set of instructions can execute for causing a device to perform or execute any one or more of the aspects and/or methodologies for static code scheduling of the present disclosure.
  • a computer system 400 e.g., a processing or computing system
  • the components in FIG. 4 are examples only and do not limit the scope of use or functionality of any hardware, software, embedded logic component, or a combination of two or more such components implementing particular embodiments.
  • Computer system 400 may include one or more processors 401, a memory 403, and a storage 408 that communicate with each other, and with other components, via a bus 440.
  • the bus 440 may also link a display 432, one or more input devices 433 (which may, for example, include a keypad, a keyboard, a mouse, a stylus, etc.), one or more output devices 434, one or more storage devices 435, and various tangible storage media 436. All of these elements may interface directly or via one or more interfaces or adaptors to the bus 440.
  • the various tangible storage media 436 can interface with the bus 440 via storage medium interface 426.
  • Computer system 400 may have any suitable physical form, including but not limited to one or more integrated circuits (ICs), printed circuit boards (PCBs), mobile handheld devices (such as mobile telephones or PDAs), laptop or notebook computers, distributed computer systems, computing grids, or servers.
  • ICs integrated circuits
  • PCBs printed circuit boards
  • mobile handheld devices such as mobile telephone
  • Computer system 400 includes one or more processor(s) 401 (e.g., central processing units (CPUs) or general purpose graphics processing units (GPGPUs)) that carry out functions.
  • processor(s) 401 optionally contains a cache memory unit 402 for temporary local storage of instructions, data, or computer addresses.
  • Processor(s) 401 are configured to assist in execution of computer readable instructions.
  • Computer system 400 may provide functionality for the components depicted in FIG. 4 as a result of the processor(s) 401 executing non-transitory, processor-executable instructions embodied in one or more tangible computer-readable storage media, such as memory 403, storage 408, storage devices 435, and/or storage medium 436.
  • the computer-readable media may store software that implements particular embodiments, and processor(s) 401 may execute the software.
  • Memory 403 may read the software from one or more other computer-readable media (such as mass storage device(s) 435, 436) or from one or more other sources through a suitable interface, such as network interface 420.
  • the software may cause processor(s) 401 to carry out one or more processes or one or more steps of one or more processes described or illustrated herein. Carrying out such processes or steps may include defining data structures stored in memory 403 and modifying the data structures as directed by the software.
  • the memory 403 may include various components (e.g., machine readable media) including, but not limited to, a random access memory component (e.g., RAM 404) (e.g., static RAM (SRAM), dynamic RAM (DRAM), ferroelectric random access memory (FRAM), phasechange random access memory (PRAM), etc.), a read-only memory component (e.g., ROM 405), and any combinations thereof.
  • ROM 405 may act to communicate data and instructions unidirectionally to processor(s) 401
  • RAM 404 may act to communicate data and instructions bidirectionally with processor(s) 401.
  • ROM 405 and RAM 404 may include any suitable tangible computer-readable media described below.
  • a basic input/output system 406 (BIOS) including basic routines that help to transfer information between elements within computer system 400, such as during start-up, may be stored in the memory 403.
  • BIOS basic input/output system 406
  • Fixed storage 408 is connected bidirectionally to processor(s) 401, optionally through storage control unit 407.
  • Fixed storage 408 provides additional data storage capacity and may also include any suitable tangible computer-readable media described herein.
  • Storage 408 may be used to store operating system 409, executable(s) 410, data 411, applications 412 (application programs), and the like.
  • Storage 408 can also include an optical disk drive, a solid-state memory device (e.g., flash-based systems), or a combination of any of the above.
  • Information in storage 408 may, in appropriate cases, be incorporated as virtual memory in memory 403.
  • storage device(s) 435 may be removably interfaced with computer system 400 (e.g., via an external port connector (not shown)) via a storage device interface 425.
  • storage device(s) 435 and an associated machine-readable medium may provide non-volatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the computer system 400.
  • software may reside, completely or partially, within a machine-readable medium on storage device(s) 435.
  • software may reside, completely or partially, within processor(s) 401.
  • Bus 440 connects a wide variety of subsystems.
  • reference to a bus may encompass one or more digital signal lines serving a common function, where appropriate.
  • Bus 440 may be any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combinations thereof, using any of a variety of bus architectures.
  • such architectures include an Industry Standard Architecture (ISA) bus, an Enhanced ISA (EISA) bus, a Micro Channel Architecture (MCA) bus, a Video Electronics Standards Association local bus (VLB), a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, an Accelerated Graphics Port (AGP) bus, HyperTransport (HTX) bus, serial advanced technology attachment (SATA) bus, and any combinations thereof.
  • ISA Industry Standard Architecture
  • EISA Enhanced ISA
  • MCA Micro Channel Architecture
  • VLB Video Electronics Standards Association local bus
  • PCI Peripheral Component Interconnect
  • PCI-X PCI-Express
  • AGP Accelerated Graphics Port
  • HTTP HyperTransport
  • SATA serial advanced technology attachment
  • Computer system 400 may also include an input device 433.
  • a user of computer system 400 may enter commands and/or other information into computer system 400 via input device(s) 433.
  • Examples of an input device(s) 433 include, but are not limited to, an alpha-numeric input device (e.g., a keyboard), a pointing device (e.g., a mouse or touchpad), a touchpad, a touch screen, a multi-touch screen, a joystick, a stylus, a gamepad, an audio input device (e.g., a microphone, a voice response system, etc.), an optical scanner, a video or still image capture device (e.g., a camera), and any combinations thereof.
  • an alpha-numeric input device e.g., a keyboard
  • a pointing device e.g., a mouse or touchpad
  • a touchpad e.g., a touch screen
  • a multi-touch screen e.g., a joystick,
  • the input device is a Kinect, Leap Motion, or the like.
  • Input device(s) 433 may be interfaced to bus 440 via any of a variety of input interfaces 423 (e.g., input interface 423) including, but not limited to, serial, parallel, game port, USB, FIREWIRE, THUNDERBOLT, or any combination of the above.
  • computer system 400 when computer system 400 is connected to network 430, computer system 400 may communicate with other devices, specifically mobile devices and enterprise systems, distributed computing systems, cloud storage systems, cloud computing systems, and the like, connected to network 430. Communications to and from computer system 400 may be sent through network interface 420.
  • network interface 420 may receive incoming communications (such as requests or responses from other devices) in the form of one or more packets (such as Internet Protocol (IP) packets) from network 430, and computer system 400 may store the incoming communications in memory 403 for processing.
  • IP Internet Protocol
  • Computer system 400 may similarly store outgoing communications (such as requests or responses to other devices) in the form of one or more packets in memory 403 and communicated to network 430 from network interface 420.
  • Processor(s) 401 may access these communication packets stored in memory 403 for processing.
  • Examples of the network interface 420 include, but are not limited to, a network interface card, a modem, and any combination thereof.
  • Examples of a network 430 or network segment 430 include, but are not limited to, a distributed computing system, a cloud computing system, a wide area network (WAN) (e.g., the Internet, an enterprise network), a local area network (LAN) (e.g., a network associated with an office, a building, a campus or other relatively small geographic space), a telephone network, a direct connection between two computing devices, a peer-to-peer network, and any combinations thereof.
  • a network, such as network 430 may employ a wired and/or a wireless mode of communication. In general, any network topology may be used.
  • Information and data can be displayed through a display 432.
  • a display 432 include, but are not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a thin film transistor liquid crystal display (TFT-LCD), an organic liquid crystal display (OLED) such as a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display, a plasma display, and any combinations thereof.
  • the display 432 can interface to the processor(s) 401, memory 403, and fixed storage 408, as well as other devices, such as input device(s) 433, via the bus 440.
  • the display 432 is linked to the bus 440 via a video interface 422, and transport of data between the display 432 and the bus 440 can be controlled via the graphics control 421.
  • the display is a video projector.
  • the display is a headmounted display (HMD) such as a VR headset.
  • suitable VR headsets include, by way of non-limiting examples, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like.
  • the display is a combination of devices such as those disclosed herein.
  • computer system 400 may include one or more other peripheral output devices 434 including, but not limited to, an audio speaker, a printer, a storage device, and any combinations thereof.
  • peripheral output devices may be connected to the bus 440 via an output interface 424.
  • Examples of an output interface 424 include, but are not limited to, a serial port, a parallel connection, a USB port, a FIREWIRE port, a THUNDERBOLT port, and any combinations thereof.
  • computer system 400 may provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which may operate in place of or together with software to execute one or more processes or one or more steps of one or more processes described or illustrated herein.
  • Reference to software in this disclosure may encompass logic, and reference to logic may encompass software.
  • reference to a computer-readable medium may encompass a circuit (such as an IC) storing software for execution, a circuit embodying logic for execution, or both, where appropriate.
  • the present disclosure encompasses any suitable combination of hardware, software, or both.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
  • a software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
  • the storage medium may be integral to the processor.
  • the processor and the storage medium may reside in an ASIC.
  • the ASIC may reside in a user terminal.
  • the processor and the storage medium may reside as discrete components in a user terminal.
  • suitable computing devices include, by way of non-limiting examples, server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, personal digital assistants, video game consoles, and vehicles.
  • server computers desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, personal digital assistants, video game consoles, and vehicles.
  • Suitable tablet computers include those with booklet, slate, and convertible configurations, known to those of skill in the art.
  • the computing device includes an operating system configured to perform executable instructions.
  • the operating system is, for example, software, including programs and data, which manages the device’s hardware and provides services for execution of applications.
  • server operating systems include, by way of non -limiting examples, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®.
  • suitable personal computer operating systems include, by way of non-limiting examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®.
  • the operating system is provided by cloud computing.
  • suitable mobile smartphone operating systems include, by way of non-limiting examples, Nokia® Symbian® OS, Apple® iOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®.
  • suitable media streaming device operating systems include, by way of non-limiting examples, Apple TV®, Roku®, Boxee®, Google TV®, Google Chromecast®, Amazon Fire®, and Samsung® HomeSync®.
  • video game console operating systems include, by way of nonlimiting examples, Sony® PS3®, Sony® PS4®, Microsoft® Xbox 360®, Microsoft Xbox One, Nintendo® Wii®, Nintendo® Wii U®, and Ouya®.
  • Non-transitory computer readable storage medium
  • the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked computing device.
  • a computer readable storage medium is a tangible component of a computing device.
  • a computer readable storage medium is optionally removable from a computing device.
  • a computer readable storage medium includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, distributed computing systems including cloud computing systems and services, and the like.
  • the program and instructions are permanently, substantially permanently, semipermanently, or non-transitorily encoded on the media.
  • the platforms, systems, media, and methods disclosed herein include at least one computer program, or use of the same.
  • a computer program includes a sequence of instructions, executable by one or more processor(s) of the computing device’s CPU, written to perform a specified task.
  • Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), computing data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages.
  • APIs Application Programming Interfaces
  • the functionality of the computer readable instructions may be combined or distributed as desired in various environments.
  • a computer program comprises one sequence of instructions.
  • a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules. In various embodiments, a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.
  • the platforms, systems, media, and methods disclosed herein include software, server, and/or database modules, or use of the same.
  • software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art.
  • the software modules disclosed herein are implemented in a multitude of ways.
  • a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof.
  • a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof.
  • the one or more software modules comprise, by way of non-limiting examples, a web application, a mobile application, and a standalone application.
  • software modules are in one computer program or application. In other embodiments, software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on a distributed computing platform such as a cloud computing platform. In some embodiments, software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location. Databases
  • the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same.
  • suitable databases include, by way of non-limiting examples, relational databases, non-relational databases, object oriented databases, object databases, entity -relationship model databases, associative databases, and XML databases. Further non-limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, and Sybase.
  • a database is internet-based.
  • a database is webbased.
  • a database is cloud computing-based.
  • a database is a distributed database.
  • a database is based on one or more local computer storage devices.
  • machine learning algorithms are utilized to generate a trained model or classifier configured to process input data comprising a plurality of features and generate an output indicative of a predicted outcome or classification.
  • the plurality of features may include scores based on gene sets, for example, GSEA gene set enrichment scores, although metrics calculated based on gene sets are also contemplated.
  • the machine learning algorithms herein employ one or more forms of labels including but not limited to human annotated labels and semi-supervised labels.
  • the labels can be indicative of treatment outcomes for cancer patients.
  • the labels may be indicative of response to immunotherapies.
  • Examples of labels includes complete response, partial response, stable disease, and progressive disease as measures of efficacy of a therapeutic intervention for a disease such as cancer.
  • the machine learning algorithm utilizes regression modeling, wherein relationships between predictor variables and dependent variables are determined and weighted.
  • the predicted outcome e.g., responsiveness to an immunotherapy
  • the predicted outcome is a dependent variable and is derived from a plurality of biological features such as GSEA enrichment scores.
  • Examples of machine learning algorithms can include a support vector machine (SVM), a naive Bayes classification, a random forest, a neural network, deep learning, principal component analysis (PCA), or other supervised learning algorithm or unsupervised learning algorithm for classification and regression.
  • the machine learning algorithms can be trained using one or more training datasets.
  • a machine learning algorithm uses a supervised learning approach. In supervised learning, the algorithm generates a function from labeled training data. Each training example is a pair consisting of an input object and a desired output value. In some embodiments, an optimal scenario allows for the algorithm to correctly determine the class labels for unseen instances. In some embodiments, a supervised learning algorithm requires the user to determine one or more control parameters.
  • supervised learning allows for a model or classifier to be generated or trained with training data in which the expected output is known in advance such as when the ground truth location for a communication is known.
  • a machine learning algorithm uses an unsupervised learning approach.
  • unsupervised learning the algorithm generates a function to describe hidden structures from unlabeled data (e.g., a classification or categorization is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm.
  • Approaches to unsupervised learning include: clustering, anomaly detection, and neural networks.
  • a machine learning algorithm uses a semi-supervised learning approach.
  • Semi-supervised learning combines both labeled and unlabeled data to generate an appropriate function or classifier.
  • Semi -supervised learning is usually used in data augmentation.
  • a machine learning algorithm uses a reinforcement learning approach. In reinforcement learning, the algorithm learns a policy of how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
  • a machine learning algorithm learns in batches based on the training dataset and other inputs for that batch. In other embodiments, the machine learning algorithm performs on-line learning where the weights and error calculations are constantly updated.
  • a machine learning algorithm uses a transduction approach. Transduction is similar to supervised learning but does not explicitly construct a function. Instead, tries to predict new outputs based on training inputs, training outputs, and new inputs. [0117] In some embodiments, a machine learning algorithm uses a “learning to learn” approach. In learning to learn, the algorithm learns its own inductive bias based on previous experience. [0118] In some embodiments, a machine learning algorithm is applied to new or updated emergency data to be re-trained to generate a new prediction model. In some embodiments, a machine learning algorithm or model is re-trained periodically. In some embodiments, a machine learning algorithm or model is re-trained non-periodically.
  • a machine learning algorithm or model is re-trained at least once a day, a week, a month, or a year or more. In some embodiments, a machine learning algorithm or model is re-trained at least once every 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 days or more.
  • a machine learning algorithm is provided with unlabeled or unclassified data for unsupervised learning, which leaves the algorithm to identify hidden structure amongst the cases (e.g., clustering).
  • unsupervised learning is used to identify the representations that are most useful for classifying raw data (e.g., identifying features that help separate subjects into separate cohorts that may be analyzed using different models and/or evaluated with different thresholds or rules).
  • unsupervised learning is capable of identifying hidden patterns such as relationships between certain features from the data in the knowledge base that would not be readily apparent to a human.
  • one or more sets of training data are generated and provided to a computer-implemented system comprising one or more algorithms for making predictions.
  • an algorithm utilizes a predictive model such as a neural network, a decision tree, a support vector machine, or other applicable model.
  • a predictive model such as a neural network, a decision tree, a support vector machine, or other applicable model.
  • an algorithm is able to form a classifier for generating a classification or prediction according to relevant features.
  • the features selected for classification can be classified using a variety of viable methods.
  • the trained algorithm comprises a machine learning algorithm.
  • the machine learning algorithm is selected from at least one of a supervised, semi -supervised and unsupervised learning, such as, for example, a support vector machine (SVM), a Naive Bayes classification, a random forest, an artificial neural network, a decision tree, a K-means, learning vector quantization (LVQ), regression algorithm (e.g., linear, logistic, multivariate), association rule learning, deep learning, dimensionality reduction and ensemble selection algorithms.
  • the machine learning algorithm is a support vector machine (SVM), a Naive Bayes classification, a random forest, or an artificial neural network.
  • Machine learning techniques include bagging procedures, boosting procedures, random forest algorithms, and combinations thereof.
  • Tumor samples were obtained from subjects having HNSCC (Adkins), bladder cancer, and melanoma. RNA extraction was performed on the tumor samples and used for subsequent library generation using the Lexogen QuantSeq 3’ mRNA-Seq library Prep Kit FWD for Illumina. The mRNA library was subjected to next generation sequencing using the Illumina NextSeq sequencing platform to generate gene expression data.
  • Single-sample gene set enrichment analysis (ssGSEA) was conducted according to gene sets derived from MSigDB, including KEGG and BioCarta. The 24 gene sets listed in Table 2 were subjected to GSEA to determine scores for each of the gene sets.
  • the ssGSEA analysis produced a set of 24 enrichment scores for the 24 corresponding gene sets for the HNSCC tumor samples. These 24 enrichment scores of the tumor samples were used to train a machine learning model using linear principal component analysis (PCA) and support vector machine (SVM) methods in order to predict objective response and survival.
  • PCA principal component analysis
  • SVM support vector machine
  • the trained model (the “ssGSEA biomarker model”) was then evaluated for ability to predict treatment outcome. As shown in FIG. 1, the model was evaluated using an Out Of Bag Receiver Operating Characteristic (OOB ROC) analysis, which is a way to estimate model performance on untrained datasets.
  • OOB ROC Out Of Bag Receiver Operating Characteristic
  • AUC Area Under the Curve
  • FIG. 2 is a plot showing the mean scores of individual samples in the training set (on average across OOB samplings). These data shows a 96% negative predictive value (NPV) and 93% sensitivity (SN).
  • NPV negative predictive value
  • SN 93% sensitivity
  • the ssGSEA biomarker model applied to the training set has the performance shown in Table 4.
  • DCR disease control rate
  • the DCR is the percentage of patients who had a treatment response (e.g, patients who achieved complete response, partial response, or stable disease to treatment) and is similar to “likelihood of response”.
  • I/O immune-oncology
  • the output scores were grouped into four quartiles QI, Q2, Q3, and Q4, with QI having the lowest 25% of scores and Q4 having the highest 25% of scores. The lower the score, the lower the anticipated benefit of the drug, as evidenced by the correlation between quartile and DCR.
  • the QI and Q2 divisions show a low DCR (less than 10%), whereas Q3 and Q4 have a high DCR (greater than about 40%).
  • the expected DCR in response to I/O treatment for HNSCC patients is about 30%. Therefore, if a patient’s sample has a high score and the score falls into Q3 or Q4, physicians may recommend I/O treatment, as HNSCC patients in these categories have a DCR in response to I/O of greater than about 40%.
  • HNSCC ssGSEA biomarker model achieved superior results compared to a clinically used biomarker PD-L1 model (FIG. 6).
  • the present disclosure provides a method according to the following embodiments:
  • Embodiment 1 A method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising: obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets linked to the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets linked to the plurality of biological features, thereby generating an output; and generating a determination indicative of a treatment outcome based on the output.
  • Embodiment 2 The method of embodiment 1, wherein the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
  • GO gene ontology
  • Embodiment 3 The method of embodiment 2, wherein the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, interferon gamma, antigen presentation, T-cell exhaustion, or any combination thereof.
  • Embodiment 4 The method of embodiment 2, wherein the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1.
  • Embodiment 5 The method of embodiment 2, wherein the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets from a molecular signature database (MSigDB).
  • MSigDB molecular signature database
  • Embodiment 6 The method of embodiment 5, wherein the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
  • Embodiment 7. The method of embodiment 1, wherein the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
  • Embodiment 8 The method of embodiment 1, further comprising obtaining the biological sample of said subject.
  • Embodiment 9 The method of embodiment 8, wherein said biological sample is a solid tumor or liquid biopsy.
  • Embodiment 10 The method of embodiment 8, wherein said biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
  • Embodiment 11 The method of embodiment 8, wherein said biological sample comprises cancer tissue.
  • Embodiment 12 The method of embodiment 11, wherein said cancer tissue comprises tumor-infiltrating immune cells.
  • Embodiment 13 The method of embodiment 11, wherein said biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
  • Embodiment 14 The method of embodiment 1, further comprising processing said biological sample to prevent or inhibit tissue degradation.
  • Embodiment 15 The method of embodiment 14, wherein said biological sample is processed into a formalin-fixed paraffin-embedded sample.
  • Embodiment 16 The method of embodiment 1, further comprising extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data.
  • Embodiment 17 The method of embodiment 16, wherein said RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data,
  • Embodiment 18 The method of embodiment 1, wherein said disease or condition is cancer
  • Embodiment 19 The method of embodiment 18, wherein said cancer is a solid cancer or a hematopoietic cancer.
  • Embodiment 20 The method of embodiment 18, wherein said cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
  • Embodiment 21 The method of embodiment 20, further comprising selecting said subject for prediction of said treatment outcome based on said status.
  • Embodiment 22 The method of embodiment 21, wherein said treatment outcome corresponds to one or more cancer treatments.
  • Embodiment 23 The method of embodiment 22, wherein said one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
  • Embodiment 24 The method of embodiment 22, wherein said subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
  • Embodiment 25 The method of embodiment 24, wherein said subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
  • Embodiment 26 A method for generating a trained machine learning model configured to generate a prediction of treatment outcome, comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
  • Embodiment 27 The method of embodiment 26, wherein said plurality of biological samples are obtained from said subjects prior to receiving said treatment and said subjects are classified according to said treatment outcome after receiving said treatment.
  • Embodiment 28 The method of embodiment 26, further comprising configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.

Abstract

Disclosed herein are platforms, systems, methods, and media for analyzing gene expression data to predict treatment outcome for cancer patients. Machine learning models can be used to evaluate enrichment scores derived from gene set enrichment analysis to provide accurate predictions.

Description

MACHINE LEARNING SYSTEMS AND METHODS FOR GENE SET ENRICHMENT
ANALYSIS AND SCORING
CROSS REFERENCE
[0001] This application claims the benefit of U.S. Provisional App. No. 63/346,718, filed on May 27, 2022, which is incorporated by reference in its entirety herein.
BACKGROUND
[0002] Cancer is a complex group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. Millions of new cases of cancer occur globally each year. Understanding the immune and tumor profile may help with diagnosis and treatment.
SUMMARY
[0003] Disclosed herein, in some embodiments, are systems and methods for analyzing complex data signals using artificial intelligence or machine learning algorithms to determine output pertaining to the state or status of one or more parameters.
[0004] In one aspect, the present disclosure discloses a method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition comprising obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and generating a determination indicative of the treatment outcome based on the output.
[0005] In some embodiments, the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway. In some embodiments, the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof. In some embodiments, the plurality of gene sets comprises 1, 2, 3, 4, 5, or 6 gene sets listed in Table 1. In some embodiments, the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2. [0006] In some embodiments, wherein the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
[0007] In some embodiments, the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database. In some embodiments, the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
[0008] In some embodiments, the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
[0009] In some embodiments, the method disclosed herein further comprises obtaining the biological sample of said subject. In some embodiments, the biological sample is a solid tumor or liquid biopsy. In some embodiments, the biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample. In some embodiments, the biological sample comprises cancer tissue. In some embodiments, the cancer tissue comprises tumor-infiltrating immune cells. In some embodiments, the biological sample is a mixed sample comprising said cancer tissue and noncancer cells.
[0010] In some embodiments, the method disclosed herein further comprises processing said biological sample to prevent or inhibit tissue degradation. In some embodiments, the biological sample is processed into a formalin-fixed paraffin-embedded sample.
[0011] In some embodiments, the method disclosed herein further comprises extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data. In some embodiments, the RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
[0012] In some embodiments, the disease or condition is cancer. In some embodiments, the cancer is a solid cancer or a hematopoietic cancer. In some embodiments, the cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
[0013] In some embodiments, the method disclosed herein further comprises selecting said subject for prediction of said treatment outcome based on said status. In some embodiments, the treatment outcome corresponds to one or more cancer treatments. In some embodiments, the one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy. In some embodiments, the subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy. In some embodiments, the subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
[0014] In some embodiments, the method disclosed herein further comprises selecting said subject for generating said determination indicative of said treatment outcome based on a current status of said disease or condition.
[0015] In some embodiments, the subject is treated based at least on said determination indicative of said treatment outcome.
[0016] In some embodiments, the subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
[0017] Also provided herein is a computer-implemented system for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising a processor and non-transitory computer readable storage medium comprising instructions that, when executed by the processor, causes the processor to: (i) obtain gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; (ii) conduct a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; (iii) process, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and (iv) generate a determination indicative of the treatment outcome based on the output.
[0018] In some embodiments, the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway. In some embodiments, the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof. [0019] In some embodiments, the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1. In some embodiments, the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2. In some embodiments, the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
[0020] In some embodiments, the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database. In some embodiments, the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
[0021] In some embodiments, the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
[0022] In some embodiments, the processor is configured to obtain the gene expression data for the biological sample of said subject from a database. In some embodiments, the biological sample is a solid tumor or liquid biopsy. In some embodiments, the biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample. In some embodiments, the biological sample comprises cancer tissue. In some embodiments, the cancer tissue comprises tumor-infiltrating immune cells. In some embodiments, the biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
[0023] In some embodiments, the biological sample is processed to prevent or inhibit tissue degradation. In some embodiments, the biological sample is processed into a formalin-fixed paraffin-embedded sample.
[0024] In some embodiments, the RNA is extracted from said biological sample, an RNA library is generated from said extracted RNA, and RNA-Seq is performed on the RNA library to generate said gene expression data. In some embodiments, the RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
[0025] In some embodiments, the disease or condition is cancer. In some embodiments, the cancer is a solid cancer or a hematopoietic cancer. In some embodiments, the cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer. [0026] In some embodiments, the subject is selected for prediction of said treatment outcome based on said status. In some embodiments, the treatment outcome corresponds to one or more cancer treatments. In some embodiments, the one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy. In some embodiments, the subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy. In some embodiments, the subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
[0027] In some embodiments, the subject is selected for evaluation to generate said determination indicative of said treatment outcome based on a current status of said disease or condition.
[0028] In some embodiments, the subject is treated based at least on said determination indicative of said treatment outcome.
[0029] In some embodiments, the subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
[0030] Also provided herein is a method for generating a trained machine learning model configured to generate a prediction of treatment outcome, comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome. [0031] In some embodiments, the plurality of biological samples is obtained from said subjects prior to receiving said treatment and said subjects are classified according to said treatment outcome after receiving said treatment.
[0032] In some embodiments, the method further comprises configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
[0033] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure.
Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
BRIEF DESCRIPTION OF THE DRAWINGS
[0034] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
[0035] The novel features of the disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings of which:
[0036] FIG. 1 shows a receiver operating characteristic (ROC) curve of false positive rate (FPR) vs. true positive rate (TPR) of a machine learning model trained on a gene set enrichment analysis (GSEA) training set for clinical outcome according to one or more embodiments herein . Bootstrapped datasets, which contain a subset of the whole HNSCC dataset while leaving out a subset of the dataset for testing, were iteratively generated. On each iteration, a model was generated with the bootstrapped dataset, and an AUC value was calculated from the held out test set. Shown is the mean of all the AUCs calculated. Shading represents the confidence intervals.
[0037] FIG. 2 shows a graph of training samples across out of bag (OOB) samplings of a GSEA training set for clinical outcome according to one or more embodiments herein; [0038] FIG. 3 shows a graph of a percentage of patients that had a response to treatment (disease control rate, DCR) per score division (quartile) of a GSEA training set for clinical outcome according to one or more embodiments herein;
[0039] FIG. 4 shows a non-limiting example of a computing device; in this case, a device with one or more processors, memory, storage, and a network interface; and
[0040] FIG. 5 shows a non-limiting example of a workflow for processing a biological sample and using gene set enrichment analysis and machine learning model to predict a response to therapy or a treatment outcome.
[0041] FIG. 6 shows ROC curves of false positive rate (specificity) vs. true positive rate (sensitivity) for models, a single-sample GSEA (ssGSEA) biomarker model and a PD-L1 biomarker model. The ssGSEA model shows better performance on a Head and Neck Squamous Cell Carcinoma (HNSCC) dataset than does the clinically used PD-L1 biomarker model biomarker model. Shown are the mean OOB prediction scores of each of the samples used to train the model to build a single ROC curve, with a single value for AUC. FIG. 1 and FIG. 6 use slightly different forms of the GSEA biomarker model based on the same dataset.
DETAILED DESCRIPTION
[0042] Disclosed herein are platforms, systems, methods, and media for analyzing gene expression data to predict treatment outcome for cancer patients. Machine learning models can be trained and used to evaluate enrichment scores derived from gene set enrichment analysis to provide accurate predictions.
[0043] While some approaches seek to identify the fraction or amount of immune cell types that have infiltrated a tumor sample and leverage this information to make predictions, such approaches tend to rely on deconvolution algorithms. The expression level of one or more biomarker genes may be quantified directly and combined with immune cell information to make up a feature set for statistical analysis. By contrast, the instant disclosure includes the discovery that a computationally simpler and more coherent approach using gene set enrichment analysis can provide accurate predictions without relying on such algorithms for quantifying immune cells within a sample. Thus, gene set enrichment scores may be directly used as features in a machine learning model to predict treatment outcome (e.g., response to immunotherapy) without going through an unnecessary intermediate step of deconvolving gene expression data to quantify immune cells and then using the quantified numbers as input features. Instead of using features corresponding to individual genetic biomarkers, this approach of using gene sets has demonstrated surprisingly accurate performance across multiple cancer types such as HNSCC, in which it has achieved superior results compared to the clinically used biomarker programmed death-ligand 1 (PD-L1, also known as CD274) (FIG. 6). PD-L1 inhibits the adaptive immune response and is often expressed at high levels in cancer cells, and therefore was proposed as a potential target for cancer immunotherapy in the clinic.
[0044] The systems and methods disclosed herein can provide highly accurate evaluations or determinations indicative of an outcome. Examples of performance metrics include accuracy, specificity, sensitivity, positive predictive value, negative predictive value, and receiver operating characteristic/ area under receiver operating characteristic (ROC/AUROC). Any combination of these metrics may be determined for a machine learning model or classifier by testing it against a set of independent samples. True positive (TP) is a positive test result that detects the condition when the condition is present (e.g., positive response to cancer treatment). True negative (TN) is a negative test result that does not detect the condition when the condition is absent. False positive (FP) is a test result that detects the condition when the condition is absent. False negative (FN) is a test result that does not detect the condition when the condition is present. The performance metrics of accuracy, specificity, sensitivity, positive predictive value, and negative predictive value can then be defined according to the following formulas:
Accuracy = (TP + TN) / (TP + FP + FN + TN) Specificity (“true negative rate”) = TN / (TN + FP) Sensitivity (“true positive rate”) = TP / (TP + FN) Positive predictive value (PPV or “precision”) = TP / (TP + FP) Negative predictive value (NPV) = TN / (TN + FN).
[0045] The AUROC can be determined by creating the ROC curve which entails plotting the true positive rate (TP) against the false positive rate (FP) and varies between 0 and 1. A sample may be evaluated according to the systems and methods disclosed herein to generate an evaluation or determination such as a prediction of treatment outcome that provide a minimum threshold of performance. In some cases, the analytical algorithm or module (e.g., comprising a machine learning model) has an accuracy of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has a specificity of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has a sensitivity of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has a PPV of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has an NPV of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has an ROC of at least about 0.6, 0.7, 0.8, 0.85, 0.9, or 0.95 or higher. [0046] In some embodiments, the methods disclosed herein comprise processing a biological sample to obtain gene expression data and performing gene set expression analysis on the gene expression data to generate an evaluation or prediction of outcome.
[0047] An illustrative and non-limiting embodiment of a workflow process is depicted in FIG. 5. In a first step a biological sample is processed to extract RNA 501. The biological sample may be a formalin-fixed paraffin-embedded (FFPE) sample. Next, the extracted RNA is used to generate an mRNA-Seq library 502. Various suitable methods may be used for library generation including commercial kits such as, for example, the QuantSeq 3’ mRNA-Seq library prep kit. Next Generation Sequencing is then performed on the library 503. Various suitable platforms may be used for the sequencing, for example, the NextSeq platform by Illumina. Next, single-sample Gene Set Enrichment Analysis (ssGSEA) is performed on the gene expression data generated from the sequencing 504. Various gene sets may be used including independently curated gene sets as well as from public databases such as, for example, gene sets obtained from MSigDB. The gene sets can be derived from various collections such as hallmark gene sets, positional gene sets, curated gene sets, chemical and genetic perturbations, canonical pathways, regulatory target, microRNA targets, transcription factor targets, computational gene sets, cancer gene neighborhoods, cancer modules, ontology gene sets, Gene Ontology derived gene sets, oncogenic signature gene sets, immunologic signature gene sets, cell type signature gene sets, or any combination thereof. Subsets of the canonical pathways gene sets include gene sets derived from BioCarta pathway database, KEGG pathway database, PID pathway database, Reactome pathway database, and WikiPathways pathway database. The ssGSEA can be used to generate an output corresponding to the gene sets that have been evaluated using the gene expression data. The output can be a metric or a score, for example, an enrichment score for each gene set. Accordingly, the machine learning model analyzes the enrichment scores corresponding to the gene sets to predict a response to therapy 505. Non-limiting examples of gene sets suitable for use according to the systems and methods disclosed herein are provided in Table 2.
[0048] In this example, each enrichment score for a gene set forms a feature that makes up part of the input to the trained machine learning model. The response to therapy can be any suitable metric, indicator, or classification. For example, a regression model may output a number between 0 and 1 indicative of responsiveness to therapy. Alternatively, a classifier may generate a classification between two or more categories such as, for example, response to treatment, partial response to treatment, no response to treatment, survival, etc. Various suitable machine learning models can be used. Support vector machine (SVM) is suitable for both regression and classification analysis and can provide a high level of accuracy without requiring significant computing power. Alternatively, or in combination, principal component analysis (PCA) can be used to analyze the ssGSEA output. In some cases, the machine learning model is configured to process at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, or 100 gene set metrics (e.g., enrichment scores). In some cases, the machine learning model is configured to process no more than 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, of 500 gene sets. In some cases, each gene set independently comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes. In some cases, each gene set independently comprises no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes. As an illustrative example, gene set enrichment analysis may be performed on 10 different gene sets, one of which has 10 genes and one of which has 200 genes. In some cases, the systems and methods disclosed herein utilize 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 gene sets, each of which independently comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes and/or no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes.
[0049] In some embodiments, one or more of the gene sets used as features in the predictive model (e.g., machine learning model) do not utilize the full list of genes within a known gene set. For example, a less than 100% fraction of the genes in a given gene set may be used for calculating an output or metric for that gene set. In some cases, for a given gene set (such as any one or more of those listed in Table 2), at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the genes listed for the gene set are used to calculate the output or metric for the gene set. In some cases, for a given gene set (such as any one or more of those listed in Table 2), no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the genes listed for the gene set are used to calculate the output or metric for the gene set. For instance, gene set 1 (“HALLMARK EPITHELIAL MESENCHYMAL TRANSITION”) from Table 2 includes 200 genes associated with epithelial to mesenchymal transition. As an illustrative example, the gene set enrichment analysis performed according to the systems and methods disclosed herein may utilize 50% of the genes in this gene set with respect to epithelial to mesenchymal transition in combination with certain independently determined percentages of other gene sets in Table 1.
Table 1
Figure imgf000013_0001
Table 2
Figure imgf000013_0002
Figure imgf000014_0001
[0050] As discussed above, any combination of genes within each gene set may be used for gene set enrichment analysis to generate a corresponding output metric such as an enrichment score. Then the output for a plurality of gene sets can be used as input features provided to a machine learning algorithm or model to generate a composite score indicative of a prediction such as an outcome or treatment outcome. The identities of the genes making up each gene set listed in Table 2 can be found on the publicly accessible database MSigDB and are also listed in Table 3, which shows the gene member identification used by MSigDB alongside the corresponding NCBI Gene ID and Gene Symbol.
Table 3
Figure imgf000014_0002
Figure imgf000014_0003
Figure imgf000015_0001
Figure imgf000015_0002
Figure imgf000016_0001
Figure imgf000016_0002
Figure imgf000017_0001
Figure imgf000017_0002
Figure imgf000018_0001
Figure imgf000018_0002
Figure imgf000019_0001
Figure imgf000019_0002
Figure imgf000020_0001
Figure imgf000020_0002
Figure imgf000021_0001
Figure imgf000021_0002
Figure imgf000022_0001
Figure imgf000022_0002
Figure imgf000023_0001
Figure imgf000023_0002
Figure imgf000024_0001
Figure imgf000024_0002
Figure imgf000025_0001
Figure imgf000025_0002
Figure imgf000026_0001
Figure imgf000026_0002
Figure imgf000027_0001
Figure imgf000027_0002
Figure imgf000028_0001
Figure imgf000028_0002
Figure imgf000029_0001
Figure imgf000029_0002
Figure imgf000030_0001
Figure imgf000030_0002
Figure imgf000031_0001
Figure imgf000031_0002
Figure imgf000032_0001
Figure imgf000032_0002
Figure imgf000033_0001
Figure imgf000033_0002
Figure imgf000034_0001
Figure imgf000034_0002
Figure imgf000035_0001
Figure imgf000035_0002
Figure imgf000036_0001
Figure imgf000036_0002
Figure imgf000037_0001
Figure imgf000037_0002
Figure imgf000038_0001
Figure imgf000038_0002
Figure imgf000039_0001
Figure imgf000039_0002
Figure imgf000040_0001
Figure imgf000040_0002
Figure imgf000041_0001
Figure imgf000041_0002
Figure imgf000042_0001
Figure imgf000042_0002
Figure imgf000043_0002
Figure imgf000043_0001
Figure imgf000044_0001
[0051] The output can make up the features that are processed using an algorithm such as a trained model generated using machine learning to generate an evaluation such as a predicted treatment outcome.
[0052] Provided herein are systems and methods for generating a predicted treatment outcome from a sample of a subject. In some instances, the subject has or is suspected of having a disease or disorder. The disease or disorder can be a cancer. In some instances, the predicted treatment outcome is for an immunotherapy targeting a cancer.
[0053] In some instances, the methods disclosed herein comprise obtaining a sample from a subject. In some instances, the sample is any fluid or other material derived from the body of a normal or disease subject including, but not limited to, blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, milk, amniotic fluid, bile, ascites fluid, organ or tissue extract, and culture fluid in which any cells or tissue preparation from a subject has been incubated. In some instances, the sample is obtained from skin, blood, brain, bladder, bone, bone marrow, breast, colon, stomach, esophagus, ovary, uterus, gallbladder, fallopian tube, testicle, kidney, liver, pancreas, adrenal gland, cervix, endometrium, head or neck, lung, prostate, thymus, thyroid, lymph node, or urinary bladder. In some instances, the sample is a cancer sample or biopsy. The cancer sample is typically a solid tumor sample or a liquid tumor sample. For example, the cancer sample can be obtained from excised tissue. In some instances, the samples, is fresh, frozen, or fixed. In some instances, a fixed sample comprises paraffin-embedded or fixation by formalin, formaldehyde, or gluteraldehyde. In some instances, the sample is formalin-fixed paraffin-embedded.
[0054] In some instances, the sample is stored after it has been collected, but before additional steps are to be performed. In some instances, the sample is stored at less than 8° C. In some instances, the sample is stored at less than 4° C. In some instances, the sample is stored at less than 0° C. In some instances, the sample is stored at less than -20° C. In some instances, the sample is stored at less than -70° C. In some instances, the sample is stored a solution comprising glycerol, glycol, dimethyl sulfoxide, growth media, nutrient broth or any combination thereof. The sample may be stored for any suitable period of time. In some instances, the sample is stored for any period of time and remains suitable for downstream applications. For example, the sample is stored for any period of time before nucleic acid (e.g., ribonucleic acid (RNA) or deoxyribonucleic acid (DNA)) extraction. In some instances, the sample is stored for at least or about 1 day, 2 day, 3 days, 4 days, 5 days, 6 days, 7 days, 1 week, 2 weeks, 3 weeks, 4 weeks, 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, 7 months, 8 months, 9 months, 10 months, 11 months, 12 months, or more than 12 months. In some instances, the sample is stored for at least 1 year, 2 years, 3, years, 4 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years, 11 years, 12 years, or more than 12 years.
[0055] Methods and systems as described herein comprise generating an immune-oncology profile from a sample of a subject, wherein the sample comprises a nucleic acid molecule. In some instances, the nucleic acid molecule is RNA, DNA, fragments, or combinations thereof. In some instances, after a sample is obtained, the sample is processed further before analysis. In some instances, the sample is processed to extract the nucleic acid molecule from the sample. In some instances, no extraction or processing procedures are performed on the sample. In some instances, the nucleic acid is extracted using any technique that does not interfere with subsequent analysis. Extraction techniques include, for example, alcohol precipitation using ethanol, methanol or isopropyl alcohol. In some instances, extraction techniques use phenol, chloroform, or any combination thereof. In some instances, extraction techniques use a column or resin based nucleic acid purification scheme such as those commonly sold commercially. In some instances, following extractions, the nucleic acid molecule is purified. In some instances, the nucleic acid molecule is further processed. For example, following extraction and purification, RNA is further reverse transcribed to cDNA. In some instances, processing of the nucleic acid comprises amplification. Following extraction or processing, in some instances, the nucleic acid is stored in water, Tris buffer, or Tris-EDTA buffer before subsequent analysis. In some instances, the sample is stored at less than 8° C. In some instances, the sample is stored at less than 4° C. In some instances, the sample is stored at less than 0° C. In some instances, the sample is stored at less than -20° C. In some instances, the sample is stored at less than -70° C. In some instances, the sample is stored for at least or about 1 day, 2 day, 3 days, 4 days, 5 days, 6 days, 7 days, 1 week, 2 weeks, 3 weeks, 4 weeks, 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, 7 months, 8 months, 9 months, 10 months, 11 months, 12 months, or more than 12 months. [0056] A nucleic acid molecule obtained from a sample comprises may be characterized by factors such as integrity of the nucleic acid molecule or size of the nucleic acid molecule. In some instances, the nucleic acid molecule is DNA. In some instances, the nucleic acid molecule is RNA. In some instances, the RNA or DNA comprises a specific integrity. For example, the RNA integrity number (RIN) of the RNA is no more than about 2. In some instances, the RNA molecules in a sample have a RIN of about 2 to about 10. In some instances, the RNA molecules in a sample have a RIN of at least about 2. In some instances, the RNA molecules in a sample have a RIN of at most about 10. In some instances, the RNA molecules in a sample have a RIN of about 2 to about 3, about 2 to about 4, about 2 to about 5, about 2 to about 6, about 2 to about 7, about 2 to about 8, about 2 to about 9, about 2 to about 10, about 3 to about 4, about 3 to about 5, about 3 to about 6, about 3 to about 7, about 3 to about 8, about 3 to about 9, about 3 to about 10, about 4 to about 5, about 4 to about 6, about 4 to about 7, about 4 to about 8, about 4 to about 9, about 4 to about 10, about 5 to about 6, about 5 to about 7, about 5 to about 8, about 5 to about 9, about 5 to about 10, about 6 to about 7, about 6 to about 8, about 6 to about 9, about 6 to about 10, about 7 to about 8, about 7 to about 9, about 7 to about 10, about 8 to about 9, about 8 to about 10, or about 9 to about 10. The RNA molecule in a sample may be characterized by size. In some instances, at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, or 90%, or more of the RNA molecules in a sample are at least 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, or more than 400 nucleotides in size. In some instances, the RNA molecules in the sample are at least 200 nucleotides in size. In some instances, the RNA molecules of at least 200 nucleotides in size comprise a percentage of the sample (DV200). For example, the percentage is at least or about 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or more than 95%. In some instances, the RNA molecules in a sample have a DV200 value of about 10% to about 90%. In some instances, the RNA molecules in a sample have a DV200 value of at least about 10%. In some instances, the RNA molecules in a sample have a DV200 value of at most about 90%. In some instances, the RNA molecules in a sample have a DV200 value of about 10% to about 20%, about 10% to about 30%, about 10% to about 40%, about 10% to about 50%, about 10% to about 60%, about 10% to about 70%, about 10% to about 80%, about 10% to about 90%, about 20% to about 30%, about 20% to about 40%, about 20% to about 50%, about 20% to about 60%, about 20% to about 70%, about 20% to about 80%, about 20% to about 90%, about 30% to about 40%, about 30% to about 50%, about 30% to about 60%, about 30% to about 70%, about 30% to about 80%, about 30% to about 90%, about 40% to about 50%, about 40% to about 60%, about 40% to about 70%, about 40% to about 80%, about 40% to about 90%, about 50% to about 60%, about 50% to about 70%, about 50% to about 80%, about 50% to about 90%, about 60% to about 70%, about 60% to about 80%, about 60% to about 90%, about 70% to about 80%, about 70% to about 90%, or about 80% to about 90%.
[0057] In some instances, after the samples have been obtained and nucleic acid molecule isolated, the nucleic acid molecule is prepared for sequencing. In some instances, a sequencing library is prepared. Numerous library generation methods have been described. In some instances, methods for library generation comprise addition of a sequencing adapter. Sequencing adapters may be added to the nucleic acid molecule by ligation. In some instances, library generation comprises an end-repair reaction.
[0058] Sometimes, library generation for sequencing comprises an enrichment step. For example, coding regions of the mRNA are enriched. In some instances, the enrichment step is for a subset of genes. In some instances, the enrichment step comprises using a bait set. The bait set may be used to enrich for genes used for specific downstream applications. A bait set generally refers to a set of baits targeted toward a selected set of genomic regions of interest. For example, a bait set may be selected for genomic regions relating to at least one of immune modulatory molecule expression, cell type and ratio, or mutational burden. In some instances, one bait set is used for determining immune modulatory molecule expression, a second bait set is used for determining cell type and ratio, and a third bait set is used for determining mutational burden. In some instances, the same bait set is used for determining immune modulatory molecule expression, cell type and ratio, mutational burden, or combinations thereof. In some instances, a bait set comprises at least one unique molecular identifier (UMI). The term “unique molecular identifier (UMI)” or “UMI” as used herein refers to nucleic acid having a sequence which can be used to identify and/or distinguish one or more first molecules to which the UMI is conjugated from one or more second molecules. In some instances, the UMI is conjugated to one or more target molecules of interest or amplification products thereof. UMIs may be single or double stranded.
[0059] The systems and methods disclosed herein provide for the sequencing for a number of genes. In some instances, the number of genes is at least about 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, or more than 10000 genes. In some instances, the number of genes to be sequenced is in a range of about 500 to about 1000 genes. In some instances, the number of genes to be sequenced is in a range of about at least 200. In some instances, the number of genes to be sequenced is in a range of about at most 10,000. In some instances, the number of genes to be sequenced is in a range of about 200 to 500, 200 to 1,000, 200 to 2,000, 200 to 4,000, 200 to 6,000, 200 to 8,000, 200 to 10,000, 500 to 1,000, 500 to 2,000, 500 to 4,000, 500 to 6,000, 500 to 8,000, 500 to 10,000, 1,000 to 2,000, 1,000 to 4,000, 1,000 to 6,000, 1,000 to 8,000, 1,000 to 10,000, 2,000 to 4,000, 2,000 to 6,000, 2,000 to 8,000, 2,000 to 10,000, 4,000 to 6,000, 4,000 to 8,000, 4,000 to 10,000, 6,000 to 8,000, 6,000 to 10,000, or 8,000 to 10,000.
[0060] Sequencing may be performed with any appropriate sequencing technology. Examples of sequencing methods include, but are not limited to single molecule real-time sequencing, Polony sequencing, sequencing by ligation, reversible terminator sequencing, proton detection sequencing, ion semiconductor sequencing, nanopore sequencing, electronic sequencing, pyrosequencing, Maxam-Gilbert sequencing, chain termination (e.g., Sanger) sequencing, +S sequencing, or sequencing by synthesis.
[0061] Sequencing methods may include, but are not limited to, one or more of: high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq (Illumina), Digital Gene Expression (Helicos), Next generation sequencing, Single Molecule Sequencing by Synthesis (SMSS) (Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, Maxim-Gilbert sequencing, and primer walking. Sequencing may generate sequencing reads (“reads”), which may be processed (e.g., alignment) to yield longer sequences, such as consensus sequences. Such sequences may be compared to references (e.g., a reference genome or control) to identify variants, for example. [0062] An average read length from sequencing may vary. In some instances, the average read length is at least about 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, or more than 80000 base pairs. In some instances, the average read length is in a range of about 100 to 80,000. In some instances, the average read length is in a range of about at least 100. In some instances, the average read length is in a range of about at most 80,000. In some instances, the average read length is in a range of about 100 to 200, 100 to 300, 100 to 500, 100 to 1,000, 100 to 2,000, 100 to 4,000, 100 to 8,000, 100 to 10,000, 100 to 20,000, 100 to 40,000, 100 to 80,000, 200 to 300, 200 to 500, 200 to 1,000, 200 to 2,000, 200 to 4,000, 200 to 8,000, 200 to 10,000, 200 to 20,000, 200 to 40,000, 200 to 80,000, 300 to 500, 300 to 1,000, 300 to 2,000, 300 to 4,000, 300 to 8,000, 300 to 10,000, 300 to 20,000, 300 to 40,000, 300 to 80,000, 500 to 1,000, 500 to 2,000, 500 to 4,000, 500 to 8,000, 500 to 10,000, 500 to 20,000, 500 to 40,000, 500 to 80,000, 1,000 to 2,000, 1,000 to 4,000, 1,000 to 8,000, 1,000 to 10,000, 1,000 to 20,000, 1,000 to 40,000, 1,000 to 80,000, 2,000 to 4,000, 2,000 to 8,000, 2,000 to 10,000, 2,000 to 20,000, 2,000 to 40,000, 2,000 to 80,000, 4,000 to 8,000, 4,000 to 10,000, 4,000 to 20,000, 4,000 to 40,000, 4,000 to 80,000, 8,000 to 10,000, 8,000 to 20,000, 8,000 to 40,000, 8,000 to 80,000, 10,000 to 20,000, 10,000 to 40,000, 10,000 to 80,000, 20,000 to 40,000, 20,000 to 80,000, or 40,000 to 80,000.
[0063] In some instances, a number of nucleotides that are sequenced are at least or about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500, 2000, 2500, 3000, or more than 3000 nucleotides. In some instances, the number of nucleotides that are sequenced are about 5 to about 3,000 nucleotides. In some instances, the number of that are sequenced are at least 5 nucleotides. In some instances, the number of nucleotides that are sequenced are at most 3,000 nucleotides. In some instances, the number of nucleotides that are sequenced are 5 to 50, 5 to 100, 5 to 200, 5 to 400, 5 to 600, 5 to 800, 5 to 1,000, 5 to 1,500, 5 to 2,000, 5 to 2,500, 5 to 3,000, 50 to 100, 50 to 200, 50 to 400, 50 to 600, 50 to 800, 50 to 1,000, 50 to 1,500, 50 to 2,000, 50 to 2,500, 50 to 3,000, 100 to 200, 100 to 400, 100 to 600, 100 to 800, 100 to 1,000, 100 to 1,500, 100 to 2,000, 100 to 2,500, 100 to 3,000, 200 to 400, 200 to 600, 200 to 800, 200 to 1,000, 200 to 1,500, 200 to 2,000, 200 to 2,500, 200 to 3,000, 400 to 600, 400 to 800, 400 to 1,000, 400 to 1,500, 400 to 2,000, 400 to 2,500, 400 to 3,000, 600 to 800, 600 to 1,000, 600 to
1.500, 600 to 2,000, 600 to 2,500, 600 to 3,000, 800 to 1,000, 800 to 1,500, 800 to 2,000, 800 to
2.500, 800 to 3,000, 1,000 to 1,500, 1,000 to 2,000, 1,000 to 2,500, 1,000 to 3,000, 1,500 to 2,000, 1,500 to 2,500, 1,500 to 3,000, 2,000 to 2,500, 2,000 to 3,000, or 2,500 to 3,000 nucleotides.
[0064] Sequencing methods may include a barcoding or “tagging” step. In some instances, barcoding (or “tagging”) can allow for generation of a population of samples of nucleic acids, wherein each nucleic acid can be identified from which sample the nucleic acid originated. In some instances, the barcode comprises oligonucleotides that are ligated to the nucleic acids. In some instances, the barcode is ligated using an enzyme, including but not limited to, E. coli ligase, T4 ligase, mammalian ligases (e.g., DNA ligase I, DNA ligase II, DNA ligase III, DNA ligase IV), thermostable ligases, and fast ligases.
[0065] Barcoding or tagging may occur using various types of barcodes or tags. Examples of barcodes or tags include, but are not limited to, a radioactive barcode or tag, a fluorescent barcode or tag, an enzyme, a chemiluminescent barcode or tag, and a colorimetric barcode or tag. In some instances, the barcode or tag is a fluorescent barcode or tag. In some instances, the fluorescent barcode or tag comprises a fluorophore. In some instances, the fluorophore is an aromatic or heteroaromatic compound. In some instances, the fluorophore is a pyrene, anthracene, naphthalene, acridine, stilbene, benzoxaazole, indole, benzindole, oxazole, thiazole, benzothiazole, canine, carbocyanine, salicylate, anthranilate, xanthenes dye, coumarin.
Examples of xanthene dyes include, e.g., fluorescein and rhodamine dyes. Fluorescein and rhodamine dyes include, but are not limited to, 6-carboxyfluorescein (FAM), 2'7'-dimethoxy- 4'5'-dichloro-6-carboxyfluorescein (JOE), tetrachlorofluorescein (TET), 6-carboxyrhodamine (R6G), N,N,N,N'-tetramethyl-6-carboxyrhodamine (TAMRA), 6-carboxy-X-rhodamine (ROX). In some instances, the fluorescent barcode or tag also includes the naphthylamine dyes that have an amino group in the alpha or beta position. For example, naphthylamino compounds include l-dimethylaminonaphthyl-5-sulfonate, l-anilino-8-naphthalene sulfonate and 2-p-toluidinyl-6- naphthalene sulfonate, 5-(2'-aminoethyl)aminonaphthalene-l -sulfonic acid (EDANS). Examples of coumarins include, e.g., 3-phenyl-7-isocyanatocoumarin; acridines, such as 9- isothiocyanatoacridine and acridine orange; N-(p-(2-benzoxazolyl)phenyl) maleimide; cyanines, such as, e.g., indodi carbocyanine 3 (Cy3), indodicarbocyanine 5 (Cy5), indodicarbocyanine 5.5 (Cy5.5), 3-(-carboxy-pentyl)-3'-ethyl-5,5'-dimethyloxacarbocyanine (CyA); 1H, 5H, 11H, 15H- Xantheno[2,3, 4-ij : 5,6,7-i'j ']diquinolizin-l 8-ium, 9-[2 (or 4)-[[[6-[2,5-dioxo-l- pyrrolidinyl)oxy]-6-oxohexyl]amino]sulfonyl]-4 (or 2)-sulfophenyl]-2,3, 6,7, 12,13, 16,17- octahydro-inner salt (TR or Texas Red); or BODIPY™ dyes.
[0066] In some instances, a different barcode or tag is supplied a sample comprising nucleic acids. Examples of barcode lengths include barcode sequences comprising, without limitation, about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25 or more bases in length. Examples of barcode lengths include barcode sequences comprising, without limitation, from 1-5, 1-10, 5-20, or 1-25 bases in length. Barcode systems may be in base 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 or a similar coding scheme. In some instances, a number of barcodes is at least or about 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000, 18000, 20000, 25000, 30000, 40000, 50000, 100000, 500000, 1000000, or more than 1000000 barcodes. In some instances, a number of barcodes is in a range of 1-1000000 barcodes. In some instances, the number of barcodes is in a range of about 1-10 1-50 1-100 1-500 1-1000 1-5,000 1-10000 1-50000 1-100000 1-500000 1-1000000 10-50 10-100 10-500 10-1000 10-5,000 10-10000 10-50000 10-100000 10-500000 10-1000000 50-100 50-500 50-1000 50-5,000 50-10000 50-50000 50-100000 50-500000 50-1000000 100- 500 100-1000 100-5,000 100-10000 100-50000 100-100000 100-500000 100-1000000 500- 1000 500-5,000 500-10000 500-50000 500-100000 500-500000 500-1000000 1000-5,000 1000- 10000 1000-50000 1000-100000 1000-500000 1000-1000000 5,000-10000 5,000-50000 5,000- 100000 5,000-500000 5,000-1000000 10000-50000 10000-100000 10000-500000 10000- 1000000 50000-100000 50000-500000 50000-1000000 100000-500000 100000-1000000 or 500000-1000000 barcodes.
[0067] Following sequencing of a sample, sequencing data as described herein can be used for performing gene set enrichment analysis. Gene Set Enrichment Analysis (GSEA) is a computational method that seeks to determine whether a predefined set of genes (typically grouped together according to some biological feature such as molecular pathway or function) demonstrates statistically significant differences between two or more categories or biological states (e.g., treatment outcome status). A predefined set of genes may be evaluated to produce an output or metric such as, for example, a score corresponding to the difference between two or more categories or biological states. Multiple sets of genes can be evaluated to generate multiple such outputs or metrics. These outputs or metrics may comprise the features of a model such as a trained machine learning model configured to generate predictions with respect to the categories or biological state. The model may be a regression that generates an output along a continuum (e.g., any value between 0 and 1) or a classifier which generates a classification for a data set.
[0068] Provided herein are systems and methods for processing a biological sample obtained from a subject. The sample often comprises a heterogeneous composition of different cell types and/or subtypes. Sometimes, the sample is a tumor sample. The cell types and/or subtypes that make up the sample includes one or more of cancer cells, non-cancer cells, and/or immune cells. Examples of non-immune cells include salivary gland cells, mammary gland cells, lacrimal gland cells, ceruminous gland cells, eccrine sweat gland cells, apocrine sweat gland cells, sebaceous gland cells, Bowman's gland cells, Brunner's gland cells, prostate gland cells, seminal vesicle cells, bulbourethral gland cells, keratinizing epithelial cells, hair shaft cells, epithelial cells, exocrine secretory epithelial cells, uterus endometrium cells, isolated goblet cells of respiratory and digestive tracts, stomach lining mucous cells, hormone secreting cells, pituitary cells, gut and respiratory tract cells, thyroid gland cells, adrenal gland cells, chromaffin cells, Leydig cells, theca interna cells, macula densa cells of kidney, peripolar cells of kidney, mesangial cells of kidney, hepatocytes, white fat cells, brown fat cells, liver lipocytes, kidney cells, kidney glomerulus parietal cells, kidney glomerulus podocytes, kidney proximal tubule brush border cells, loop of Henle thin segment cells, kidney distal tubule cells, endothelial fenestrated cells, vascular endothelial continuous cells, synovial cells, serosal cells, squamous cells, columnar cells of endolymphatic sac with microvilli, columnar cells of endolymphatic sac without microvilli, vestibular membrane cells, stria vascularis basal cells, stria vascularis marginal cells, choroid plexus cells, respiratory tract ciliated cells, oviduct ciliated cells, uterine endometrial ciliated cells, rete testis ciliated cells, ductulus efferens ciliated cells, ciliated ependymal cells of central nervous system, organ of Corti interdental epithelial cells, loose connective tissue fibroblasts, corneal fibroblasts, tendon fibroblasts, bone marrow reticular tissue fibroblasts, other nonepithelial fibroblasts, pericytes, skeletal muscle cells, red skeletal muscle cells, white skeletal muscle cells, intermediate skeletal muscle cells, nuclear bag cells of muscle spindle, nuclear chain cells of muscle spindle, satellite cells, cardiac muscle cells, ordinary cardiac muscle cells, nodal cardiac muscle cells, purkinje fiber cells, smooth muscle cells, myoepithelial cells of iris, myoepithelial cells of exocrine glands, erythrocytes, megakaryocytes, monocytes, epidermal Langerhans cells, osteoclasts, sensory neurons, olfactory receptor neurons, pain-sensitive primary sensory neurons, photoreceptor cells of retina in eye, photoreceptor rod cells, proprioceptive primary sensory neurons (various types), touch-sensitive primary sensory neurons, taste bud cells, autonomic neuron cells, Schwann cells, satellite cells, glial cells, astrocytes, oligodendrocytes, melanocytes, germ cells, nurse cells, interstitial cells, and pancreatic duct cells. Various cell types may be evaluated for the sample using methods as described herein including, but not limited to, lymphoid cells, stromal cells, stem cells, and myeloid cells. Examples of lymphoid cells include, but are not limited to, CD4+ memory T- cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells, effector memory T (Tern) cells, CD4+ Tcm, CD4+ Tern, CD8+ T-cells, CD8+ naive T-cells, CD8+ Tcm, CD8+ Tem, regulatory T cells (Tregs), T helper (Th) 1 cells, Th2 cells, gamma delta T (Tgd) cells, natural killer (NK) cells, natural killer T (NKT) cells, B-cells, naive B-cells, memory B-cells, cl ass- switched memory B-cells, pro B-cells, and plasma cells. In some instances, the cells are stromal cells, for example, mesenchymal stem cells, adipocytes, preadipocytes, stromal cells, fibroblasts, pericytes, endothelial cells, microvascular endothelial cells, lymphatic endothelial cells, smooth muscle cells, chondrocytes, osteoblasts, skeletal muscle cells, myocytes. Examples of stem cells include, but are not limited to, hematopoietic stem cells, common lymphoid progenitor cells, common myeloid progenitor cells, granulocyte-macrophage progenitor cells, megakaryocyte-erythroid progenitor cells, multipotent progenitor cells, megakaryocytes, erythrocytes, and platelets. Examples of myeloid cells include, but are not limited to, monocytes, macrophages, macrophages Ml, macrophages M2, dendritic cells, conventional dendritic cells, plasmacytoid dendritic cells, immature dendritic cells, neutrophils, eosinophils, mast cells, and basophils. Other cell types may be evaluated using methods as described herein, for example, epithelial cells, sebocytes, keratinocytes, mesangial cells, hepatocytes, melanocytes, keratocytes, astrocytes, and neurons. [0069] In some instances, the sequencing data comprises genes that are differentially expressed by various immune cell types. Examples of immune cells to be detected by methods described herein include, but are not limited to, CD4+ memory T-cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells, effector memory T (Tern) cells, CD4+ Tcm, CD4+ Tern, CD8+ T-cells, CD8+ naive T-cells, CD8+ Tcm, CD8+ Tern, regulatory T cells (Tregs), T helper (Th) 1 cells, Th2 cells, gamma delta T (Tgd) cells, natural killer (NK) cells, natural killer T (NKT) cells, B-cells, naive B-cells, memory B-cells, cl ass- switched memory B-cells, pro B-cells, and plasma cells.
Terms and Definitions
[0070] Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. [0071] As used herein, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
[0072] As used herein, the term “about” in some cases refers to an amount that is approximately the stated amount.
[0073] As used herein, the term “about” refers to an amount that is near the stated amount by 10%, 5%, or 1%, including increments therein.
[0074] As used herein, the term “about” in reference to a percentage refers to an amount that is greater or less the stated percentage by 10%, 5%, or 1%, including increments therein.
[0075] As used herein, the phrases “at least one”, “one or more”, and “and/or” are open-ended expressions that are both conjunctive and disjunctive in operation. For example, each of the expressions “at least one of A, B and C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.
[0076] The present disclosure employs, unless otherwise indicated, conventional molecular biology techniques, which are within the skill of the art. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as is commonly understood by one of ordinary skill in the art.
[0077] Throughout this disclosure, various embodiments are presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of any embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range to the tenth of the unit of the lower limit unless the context clearly dictates otherwise. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual values within that range, for example, 1.1, 2, 2.3, 5, and 5.9. This applies regardless of the breadth of the range. The upper and lower limits of these intervening ranges may independently be included in the smaller ranges, and are also encompassed within the disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the disclosure, unless the context clearly dictates otherwise.
[0078] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of any embodiment. As used herein, the singular forms “a,” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
[0079] The term “ribonucleic acid” or “RNA,” as used herein refers to a molecule comprising at least one ribonucleotide residue. RNA may include transcripts. By “ribonucleotide” is meant a nucleotide with a hydroxyl group at the 2’ position of a beta-D-ribo-furanose moiety. The term RNA includes, but not limited to, mRNA, ribosomal RNA, tRNA, non-protein-coding RNA (npcRNA), non-messenger RNA, functional RNA (fRNA), long non-coding RNA (IncRNA), pre-mRNAs, and primary miRNAs (pri-miRNAs). The term RNA includes, for example, double-stranded (ds) RNAs; single-stranded RNAs; and isolated RNAs such as partially purified RNA, essentially pure RNA, synthetic RNA, recombinant RNA, as well as altered RNA that differ from naturally-occurring RNA by the addition, deletion, substitution and/or alteration of one or more nucleotides. Such alterations can include addition of non-nucleotide material, such as to the end(s) of the siRNA or internally, for example at one or more nucleotides of the RNA. Nucleotides in the RNA molecules described herein can also comprise non-standard nucleotides, such as non-naturally occurring nucleotides or chemically synthesized nucleotides or deoxynucleotides. These altered RNAs can be referred to as analogs or analogs of naturally- occurring RNA. [0080] Unless specifically stated or obvious from context, as used herein, the term “about” in reference to a number or range of numbers is understood to mean the stated number and numbers +/- 10% thereof, or 10% below the lower listed limit and 10% above the higher listed limit for the values listed for a range.
[0081] The term “sample,” as used herein, generally refers to a biological sample of a subject. The biological sample may be a tissue or fluid of the subject, such as blood (e.g., whole blood), plasma, serum, urine, saliva, mucosal excretions, sputum, stool and tears. The biological sample may be derived from a tissue or fluid of the subject. The biological sample may be a tumor sample or heterogeneous tissue sample. The biological sample may have or be suspected of having disease tissue. The tissue may be processed to obtain the biological sample. The biological sample may be a cellular sample. The biological sample may be a cell-free (or cell free) sample, such as cell-free DNA or RNA. The biological sample may comprise cancer cells, non-cancer cells, immune cells, non-immune cells, or any combination thereof. The biological sample may be a tissue sample. The biological sample may be a liquid sample. The liquid sample can be a cancer or non-cancer sample. Non-limiting examples of liquid biological samples include synovial fluid, whole blood, blood plasma, lymph, bone marrow, cerebrospinal fluid, serum, seminal fluid, urine, and amniotic fluid.
[0082] The term “variant,” as used herein, generally refers to a genetic variant, such as an alteration, variant or polymorphism in a nucleic acid sample or genome of a subject. Such alteration, variant or polymorphism can be with respect to a reference genome, which may be a reference genome of the subject or other individual. Single nucleotide polymorphisms (SNPs) are a form of polymorphisms. In some examples, one or more polymorphisms comprise one or more single nucleotide variations (SNVs), insertions, deletions, repeats, small insertions, small deletions, small repeats, structural variant junctions, variable length tandem repeats, and/or flanking sequences. Copy number variants (CNVs), transversions and other rearrangements are also forms of genetic variation. A genomic alternation may be a base change, insertion, deletion, repeat, copy number variation, or transversion.
[0083] The term “subject,” as used herein, generally refers to an animal, such as a mammalian species (e.g., human) or avian (e.g., bird) species, or other organism, such as a plant. More specifically, the subject can be a vertebrate, a mammal, a mouse, a primate, a simian or a human. Animals include, but are not limited to, farm animals, sport animals, and pets. The subject can be a healthy individual, an individual that has or is suspected of having a disease or a predisposition to the disease, or an individual that is in need of therapy or suspected of needing therapy. The subject can be a patient. The subject may have or be suspected of having a disease. Computing system
[0084] Referring to FIG. 4, a block diagram is shown depicting an exemplary machine that includes a computer system 400 (e.g., a processing or computing system) within which a set of instructions can execute for causing a device to perform or execute any one or more of the aspects and/or methodologies for static code scheduling of the present disclosure. The components in FIG. 4 are examples only and do not limit the scope of use or functionality of any hardware, software, embedded logic component, or a combination of two or more such components implementing particular embodiments.
[0085] Computer system 400 may include one or more processors 401, a memory 403, and a storage 408 that communicate with each other, and with other components, via a bus 440. The bus 440 may also link a display 432, one or more input devices 433 (which may, for example, include a keypad, a keyboard, a mouse, a stylus, etc.), one or more output devices 434, one or more storage devices 435, and various tangible storage media 436. All of these elements may interface directly or via one or more interfaces or adaptors to the bus 440. For instance, the various tangible storage media 436 can interface with the bus 440 via storage medium interface 426. Computer system 400 may have any suitable physical form, including but not limited to one or more integrated circuits (ICs), printed circuit boards (PCBs), mobile handheld devices (such as mobile telephones or PDAs), laptop or notebook computers, distributed computer systems, computing grids, or servers.
[0086] Computer system 400 includes one or more processor(s) 401 (e.g., central processing units (CPUs) or general purpose graphics processing units (GPGPUs)) that carry out functions. Processor(s) 401 optionally contains a cache memory unit 402 for temporary local storage of instructions, data, or computer addresses. Processor(s) 401 are configured to assist in execution of computer readable instructions. Computer system 400 may provide functionality for the components depicted in FIG. 4 as a result of the processor(s) 401 executing non-transitory, processor-executable instructions embodied in one or more tangible computer-readable storage media, such as memory 403, storage 408, storage devices 435, and/or storage medium 436. The computer-readable media may store software that implements particular embodiments, and processor(s) 401 may execute the software. Memory 403 may read the software from one or more other computer-readable media (such as mass storage device(s) 435, 436) or from one or more other sources through a suitable interface, such as network interface 420. The software may cause processor(s) 401 to carry out one or more processes or one or more steps of one or more processes described or illustrated herein. Carrying out such processes or steps may include defining data structures stored in memory 403 and modifying the data structures as directed by the software.
[0087] The memory 403 may include various components (e.g., machine readable media) including, but not limited to, a random access memory component (e.g., RAM 404) (e.g., static RAM (SRAM), dynamic RAM (DRAM), ferroelectric random access memory (FRAM), phasechange random access memory (PRAM), etc.), a read-only memory component (e.g., ROM 405), and any combinations thereof. ROM 405 may act to communicate data and instructions unidirectionally to processor(s) 401, and RAM 404 may act to communicate data and instructions bidirectionally with processor(s) 401. ROM 405 and RAM 404 may include any suitable tangible computer-readable media described below. In one example, a basic input/output system 406 (BIOS), including basic routines that help to transfer information between elements within computer system 400, such as during start-up, may be stored in the memory 403.
[0088] Fixed storage 408 is connected bidirectionally to processor(s) 401, optionally through storage control unit 407. Fixed storage 408 provides additional data storage capacity and may also include any suitable tangible computer-readable media described herein. Storage 408 may be used to store operating system 409, executable(s) 410, data 411, applications 412 (application programs), and the like. Storage 408 can also include an optical disk drive, a solid-state memory device (e.g., flash-based systems), or a combination of any of the above. Information in storage 408 may, in appropriate cases, be incorporated as virtual memory in memory 403.
[0089] In one example, storage device(s) 435 may be removably interfaced with computer system 400 (e.g., via an external port connector (not shown)) via a storage device interface 425. Particularly, storage device(s) 435 and an associated machine-readable medium may provide non-volatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the computer system 400. In one example, software may reside, completely or partially, within a machine-readable medium on storage device(s) 435. In another example, software may reside, completely or partially, within processor(s) 401.
[0090] Bus 440 connects a wide variety of subsystems. Herein, reference to a bus may encompass one or more digital signal lines serving a common function, where appropriate. Bus 440 may be any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combinations thereof, using any of a variety of bus architectures. As an example, and not by way of limitation, such architectures include an Industry Standard Architecture (ISA) bus, an Enhanced ISA (EISA) bus, a Micro Channel Architecture (MCA) bus, a Video Electronics Standards Association local bus (VLB), a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, an Accelerated Graphics Port (AGP) bus, HyperTransport (HTX) bus, serial advanced technology attachment (SATA) bus, and any combinations thereof.
[0091] Computer system 400 may also include an input device 433. In one example, a user of computer system 400 may enter commands and/or other information into computer system 400 via input device(s) 433. Examples of an input device(s) 433 include, but are not limited to, an alpha-numeric input device (e.g., a keyboard), a pointing device (e.g., a mouse or touchpad), a touchpad, a touch screen, a multi-touch screen, a joystick, a stylus, a gamepad, an audio input device (e.g., a microphone, a voice response system, etc.), an optical scanner, a video or still image capture device (e.g., a camera), and any combinations thereof. In some embodiments, the input device is a Kinect, Leap Motion, or the like. Input device(s) 433 may be interfaced to bus 440 via any of a variety of input interfaces 423 (e.g., input interface 423) including, but not limited to, serial, parallel, game port, USB, FIREWIRE, THUNDERBOLT, or any combination of the above.
[0092] In particular embodiments, when computer system 400 is connected to network 430, computer system 400 may communicate with other devices, specifically mobile devices and enterprise systems, distributed computing systems, cloud storage systems, cloud computing systems, and the like, connected to network 430. Communications to and from computer system 400 may be sent through network interface 420. For example, network interface 420 may receive incoming communications (such as requests or responses from other devices) in the form of one or more packets (such as Internet Protocol (IP) packets) from network 430, and computer system 400 may store the incoming communications in memory 403 for processing. Computer system 400 may similarly store outgoing communications (such as requests or responses to other devices) in the form of one or more packets in memory 403 and communicated to network 430 from network interface 420. Processor(s) 401 may access these communication packets stored in memory 403 for processing.
[0093] Examples of the network interface 420 include, but are not limited to, a network interface card, a modem, and any combination thereof. Examples of a network 430 or network segment 430 include, but are not limited to, a distributed computing system, a cloud computing system, a wide area network (WAN) (e.g., the Internet, an enterprise network), a local area network (LAN) (e.g., a network associated with an office, a building, a campus or other relatively small geographic space), a telephone network, a direct connection between two computing devices, a peer-to-peer network, and any combinations thereof. A network, such as network 430, may employ a wired and/or a wireless mode of communication. In general, any network topology may be used.
[0094] Information and data can be displayed through a display 432. Examples of a display 432 include, but are not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a thin film transistor liquid crystal display (TFT-LCD), an organic liquid crystal display (OLED) such as a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display, a plasma display, and any combinations thereof. The display 432 can interface to the processor(s) 401, memory 403, and fixed storage 408, as well as other devices, such as input device(s) 433, via the bus 440. The display 432 is linked to the bus 440 via a video interface 422, and transport of data between the display 432 and the bus 440 can be controlled via the graphics control 421. In some embodiments, the display is a video projector. In some embodiments, the display is a headmounted display (HMD) such as a VR headset. In further embodiments, suitable VR headsets include, by way of non-limiting examples, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like. In still further embodiments, the display is a combination of devices such as those disclosed herein.
[0095] In addition to a display 432, computer system 400 may include one or more other peripheral output devices 434 including, but not limited to, an audio speaker, a printer, a storage device, and any combinations thereof. Such peripheral output devices may be connected to the bus 440 via an output interface 424. Examples of an output interface 424 include, but are not limited to, a serial port, a parallel connection, a USB port, a FIREWIRE port, a THUNDERBOLT port, and any combinations thereof.
[0096] In addition, or as an alternative, computer system 400 may provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which may operate in place of or together with software to execute one or more processes or one or more steps of one or more processes described or illustrated herein. Reference to software in this disclosure may encompass logic, and reference to logic may encompass software. Moreover, reference to a computer-readable medium may encompass a circuit (such as an IC) storing software for execution, a circuit embodying logic for execution, or both, where appropriate. The present disclosure encompasses any suitable combination of hardware, software, or both.
[0097] Those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality.
[0098] The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
[0099] The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by one or more processor(s), or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
[0100] In accordance with the description herein, suitable computing devices include, by way of non-limiting examples, server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, personal digital assistants, video game consoles, and vehicles. Those of skill in the art will also recognize that select televisions, video players, and digital music players with optional computer network connectivity are suitable for use in the system described herein. Suitable tablet computers, in various embodiments, include those with booklet, slate, and convertible configurations, known to those of skill in the art.
[0101] In some embodiments, the computing device includes an operating system configured to perform executable instructions. The operating system is, for example, software, including programs and data, which manages the device’s hardware and provides services for execution of applications. Those of skill in the art will recognize that suitable server operating systems include, by way of non -limiting examples, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Those of skill in the art will recognize that suitable personal computer operating systems include, by way of non-limiting examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some embodiments, the operating system is provided by cloud computing. Those of skill in the art will also recognize that suitable mobile smartphone operating systems include, by way of non-limiting examples, Nokia® Symbian® OS, Apple® iOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®. Those of skill in the art will also recognize that suitable media streaming device operating systems include, by way of non-limiting examples, Apple TV®, Roku®, Boxee®, Google TV®, Google Chromecast®, Amazon Fire®, and Samsung® HomeSync®. Those of skill in the art will also recognize that suitable video game console operating systems include, by way of nonlimiting examples, Sony® PS3®, Sony® PS4®, Microsoft® Xbox 360®, Microsoft Xbox One, Nintendo® Wii®, Nintendo® Wii U®, and Ouya®.
Non-transitory computer readable storage medium
[0102] In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked computing device. In further embodiments, a computer readable storage medium is a tangible component of a computing device. In still further embodiments, a computer readable storage medium is optionally removable from a computing device. In some embodiments, a computer readable storage medium includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, distributed computing systems including cloud computing systems and services, and the like. In some cases, the program and instructions are permanently, substantially permanently, semipermanently, or non-transitorily encoded on the media.
Computer program
[0103] In some embodiments, the platforms, systems, media, and methods disclosed herein include at least one computer program, or use of the same. A computer program includes a sequence of instructions, executable by one or more processor(s) of the computing device’s CPU, written to perform a specified task. Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), computing data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages. [0104] The functionality of the computer readable instructions may be combined or distributed as desired in various environments. In some embodiments, a computer program comprises one sequence of instructions. In some embodiments, a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules. In various embodiments, a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.
Software Modules
[0105] In some embodiments, the platforms, systems, media, and methods disclosed herein include software, server, and/or database modules, or use of the same. In view of the disclosure provided herein, software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art. The software modules disclosed herein are implemented in a multitude of ways. In various embodiments, a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof. In further various embodiments, a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof. In various embodiments, the one or more software modules comprise, by way of non-limiting examples, a web application, a mobile application, and a standalone application. In some embodiments, software modules are in one computer program or application. In other embodiments, software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on a distributed computing platform such as a cloud computing platform. In some embodiments, software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location. Databases
[0106] In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same. In view of the disclosure provided herein, those of skill in the art will recognize that many databases are suitable for storage and retrieval of information. In various embodiments, suitable databases include, by way of non-limiting examples, relational databases, non-relational databases, object oriented databases, object databases, entity -relationship model databases, associative databases, and XML databases. Further non-limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, and Sybase. In some embodiments, a database is internet-based. In further embodiments, a database is webbased. In still further embodiments, a database is cloud computing-based. In a particular embodiment, a database is a distributed database. In other embodiments, a database is based on one or more local computer storage devices.
Machine Learning
[0107] In some embodiments, machine learning algorithms are utilized to generate a trained model or classifier configured to process input data comprising a plurality of features and generate an output indicative of a predicted outcome or classification. The plurality of features may include scores based on gene sets, for example, GSEA gene set enrichment scores, although metrics calculated based on gene sets are also contemplated.
[0108] In some embodiments, the machine learning algorithms herein employ one or more forms of labels including but not limited to human annotated labels and semi-supervised labels. The labels can be indicative of treatment outcomes for cancer patients. In particular, the labels may be indicative of response to immunotherapies. Examples of labels includes complete response, partial response, stable disease, and progressive disease as measures of efficacy of a therapeutic intervention for a disease such as cancer.
[0109] In some embodiments, the machine learning algorithm utilizes regression modeling, wherein relationships between predictor variables and dependent variables are determined and weighted. In one embodiment, for example, the predicted outcome (e.g., responsiveness to an immunotherapy) is a dependent variable and is derived from a plurality of biological features such as GSEA enrichment scores.
[0110] Examples of machine learning algorithms can include a support vector machine (SVM), a naive Bayes classification, a random forest, a neural network, deep learning, principal component analysis (PCA), or other supervised learning algorithm or unsupervised learning algorithm for classification and regression. The machine learning algorithms can be trained using one or more training datasets. [OHl] In some embodiments, a machine learning algorithm uses a supervised learning approach. In supervised learning, the algorithm generates a function from labeled training data. Each training example is a pair consisting of an input object and a desired output value. In some embodiments, an optimal scenario allows for the algorithm to correctly determine the class labels for unseen instances. In some embodiments, a supervised learning algorithm requires the user to determine one or more control parameters. These parameters are optionally adjusted by optimizing performance on a subset, called a validation set, of the training set. After parameter adjustment and learning, the performance of the resulting function is optionally measured on a test set that is separate from the training set. Regression methods are commonly used in supervised learning. Accordingly, supervised learning allows for a model or classifier to be generated or trained with training data in which the expected output is known in advance such as when the ground truth location for a communication is known.
[0112] In some embodiments, a machine learning algorithm uses an unsupervised learning approach. In unsupervised learning, the algorithm generates a function to describe hidden structures from unlabeled data (e.g., a classification or categorization is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm. Approaches to unsupervised learning include: clustering, anomaly detection, and neural networks.
[0113] In some embodiments, a machine learning algorithm uses a semi-supervised learning approach. Semi-supervised learning combines both labeled and unlabeled data to generate an appropriate function or classifier. Semi -supervised learning is usually used in data augmentation. [0114] In some embodiments, a machine learning algorithm uses a reinforcement learning approach. In reinforcement learning, the algorithm learns a policy of how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
[0115] In some embodiments, a machine learning algorithm learns in batches based on the training dataset and other inputs for that batch. In other embodiments, the machine learning algorithm performs on-line learning where the weights and error calculations are constantly updated.
[0116] In some embodiments, a machine learning algorithm uses a transduction approach. Transduction is similar to supervised learning but does not explicitly construct a function. Instead, tries to predict new outputs based on training inputs, training outputs, and new inputs. [0117] In some embodiments, a machine learning algorithm uses a “learning to learn” approach. In learning to learn, the algorithm learns its own inductive bias based on previous experience. [0118] In some embodiments, a machine learning algorithm is applied to new or updated emergency data to be re-trained to generate a new prediction model. In some embodiments, a machine learning algorithm or model is re-trained periodically. In some embodiments, a machine learning algorithm or model is re-trained non-periodically. In some embodiments, a machine learning algorithm or model is re-trained at least once a day, a week, a month, or a year or more. In some embodiments, a machine learning algorithm or model is re-trained at least once every 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 days or more.
[0119] In some embodiments, a machine learning algorithm is provided with unlabeled or unclassified data for unsupervised learning, which leaves the algorithm to identify hidden structure amongst the cases (e.g., clustering). In some embodiments, unsupervised learning is used to identify the representations that are most useful for classifying raw data (e.g., identifying features that help separate subjects into separate cohorts that may be analyzed using different models and/or evaluated with different thresholds or rules). For example, unsupervised learning is capable of identifying hidden patterns such as relationships between certain features from the data in the knowledge base that would not be readily apparent to a human.
[0120] In some embodiments, one or more sets of training data are generated and provided to a computer-implemented system comprising one or more algorithms for making predictions. In some embodiments, an algorithm utilizes a predictive model such as a neural network, a decision tree, a support vector machine, or other applicable model. Using the training data, an algorithm is able to form a classifier for generating a classification or prediction according to relevant features. The features selected for classification can be classified using a variety of viable methods. In some embodiments, the trained algorithm comprises a machine learning algorithm. In some embodiments, the machine learning algorithm is selected from at least one of a supervised, semi -supervised and unsupervised learning, such as, for example, a support vector machine (SVM), a Naive Bayes classification, a random forest, an artificial neural network, a decision tree, a K-means, learning vector quantization (LVQ), regression algorithm (e.g., linear, logistic, multivariate), association rule learning, deep learning, dimensionality reduction and ensemble selection algorithms. In some embodiments, the machine learning algorithm is a support vector machine (SVM), a Naive Bayes classification, a random forest, or an artificial neural network. Machine learning techniques include bagging procedures, boosting procedures, random forest algorithms, and combinations thereof. EXAMPLES
[0121] The following illustrative examples are representative of embodiments of the software applications, systems, and methods described herein and are not meant to be limiting in any way.
Example 1
[0122] Tumor samples were obtained from subjects having HNSCC (Adkins), bladder cancer, and melanoma. RNA extraction was performed on the tumor samples and used for subsequent library generation using the Lexogen QuantSeq 3’ mRNA-Seq library Prep Kit FWD for Illumina. The mRNA library was subjected to next generation sequencing using the Illumina NextSeq sequencing platform to generate gene expression data. Single-sample gene set enrichment analysis (ssGSEA) was conducted according to gene sets derived from MSigDB, including KEGG and BioCarta. The 24 gene sets listed in Table 2 were subjected to GSEA to determine scores for each of the gene sets.
[0123] The ssGSEA analysis produced a set of 24 enrichment scores for the 24 corresponding gene sets for the HNSCC tumor samples. These 24 enrichment scores of the tumor samples were used to train a machine learning model using linear principal component analysis (PCA) and support vector machine (SVM) methods in order to predict objective response and survival.
[0124] The trained model (the “ssGSEA biomarker model”) was then evaluated for ability to predict treatment outcome. As shown in FIG. 1, the model was evaluated using an Out Of Bag Receiver Operating Characteristic (OOB ROC) analysis, which is a way to estimate model performance on untrained datasets. The Area Under the Curve (AUC) of the ROC curve for the model was 0.85, indicating that the model performs well (high true positive rate and low false positive rate) at predicting treatment outcome.
[0125] FIG. 2 is a plot showing the mean scores of individual samples in the training set (on average across OOB samplings). These data shows a 96% negative predictive value (NPV) and 93% sensitivity (SN).
[0126] If the treat or no-treat decision was based on the median score being used as the demarcation (e.g., if a patient sample’s score is below the median score, the patient will not receive a treatment, and if a patient sample’s score is above the median score, the patient will receive a treatment), the ssGSEA biomarker model applied to the training set has the performance shown in Table 4. Table 4
Figure imgf000067_0001
[0127] Physicians can use the ssGSEA biomarker model in future clinical decision making by considering the disease control rate (DCR). The DCR is the percentage of patients who had a treatment response (e.g, patients who achieved complete response, partial response, or stable disease to treatment) and is similar to “likelihood of response”. Here, the DCR of HNSCC patients in response to immune-oncology (I/O) treatment is considered. As shown in FIG. 3, the output scores were grouped into four quartiles QI, Q2, Q3, and Q4, with QI having the lowest 25% of scores and Q4 having the highest 25% of scores. The lower the score, the lower the anticipated benefit of the drug, as evidenced by the correlation between quartile and DCR. In FIG. 3, the QI and Q2 divisions show a low DCR (less than 10%), whereas Q3 and Q4 have a high DCR (greater than about 40%).
[0128] The expected DCR in response to I/O treatment for HNSCC patients is about 30%. Therefore, if a patient’s sample has a high score and the score falls into Q3 or Q4, physicians may recommend I/O treatment, as HNSCC patients in these categories have a DCR in response to I/O of greater than about 40%.
[0129] As compared to models that use features corresponding to individual genetic biomarkers, this approach of using gene sets has demonstrated surprisingly accurate performance across multiple cancer types such as HNSCC. An HNSCC ssGSEA biomarker model achieved superior results compared to a clinically used biomarker PD-L1 model (FIG. 6).
[0130] When compared to other literature methods, the instant methods perform as well or better as shown in Table 1. Moreover, this technique is effective across multiple cancer types. The results are shown in Table 5.
Table 5
Figure imgf000067_0002
[0131] While preferred embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure.
EMBODIMENTS
[0132] In some cases, the present disclosure provides a method according to the following embodiments:
[0133] Embodiment 1. A method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising: obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets linked to the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets linked to the plurality of biological features, thereby generating an output; and generating a determination indicative of a treatment outcome based on the output.
[0134] Embodiment 2. The method of embodiment 1, wherein the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
[0135] Embodiment 3. The method of embodiment 2, wherein the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, interferon gamma, antigen presentation, T-cell exhaustion, or any combination thereof.
[0136] Embodiment 4. The method of embodiment 2, wherein the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1.
[0137] Embodiment 5. The method of embodiment 2, wherein the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets from a molecular signature database (MSigDB).
[0138] Embodiment 6. The method of embodiment 5, wherein the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets. [0139] Embodiment 7. The method of embodiment 1, wherein the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
[0140] Embodiment 8. The method of embodiment 1, further comprising obtaining the biological sample of said subject.
[0141] Embodiment 9. The method of embodiment 8, wherein said biological sample is a solid tumor or liquid biopsy.
[0142] Embodiment 10. The method of embodiment 8, wherein said biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
[0143] Embodiment 11. The method of embodiment 8, wherein said biological sample comprises cancer tissue.
[0144] Embodiment 12. The method of embodiment 11, wherein said cancer tissue comprises tumor-infiltrating immune cells.
[0145] Embodiment 13. The method of embodiment 11, wherein said biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
[0146] Embodiment 14. The method of embodiment 1, further comprising processing said biological sample to prevent or inhibit tissue degradation.
[0147] Embodiment 15. The method of embodiment 14, wherein said biological sample is processed into a formalin-fixed paraffin-embedded sample.
[0148] Embodiment 16. The method of embodiment 1, further comprising extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data.
[0149] Embodiment 17. The method of embodiment 16, wherein said RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data,
[0150] Embodiment 18. The method of embodiment 1, wherein said disease or condition is cancer
[0151] Embodiment 19. The method of embodiment 18, wherein said cancer is a solid cancer or a hematopoietic cancer.
[0152] Embodiment 20. The method of embodiment 18, wherein said cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
[0153] Embodiment 21. The method of embodiment 20, further comprising selecting said subject for prediction of said treatment outcome based on said status. [0154] Embodiment 22. The method of embodiment 21, wherein said treatment outcome corresponds to one or more cancer treatments.
[0155] Embodiment 23. The method of embodiment 22, wherein said one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
[0156] Embodiment 24. The method of embodiment 22, wherein said subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
[0157] Embodiment 25. The method of embodiment 24, wherein said subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
[0158] Embodiment 26. A method for generating a trained machine learning model configured to generate a prediction of treatment outcome, comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
[0159] Embodiment 27. The method of embodiment 26, wherein said plurality of biological samples are obtained from said subjects prior to receiving said treatment and said subjects are classified according to said treatment outcome after receiving said treatment.
[0160] Embodiment 28. The method of embodiment 26, further comprising configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
[0161] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

CLAIMS WHAT IS CLAIMED IS:
1. A method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising: obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and generating a determination indicative of the treatment outcome based on the output.
2. The method of claim 1, wherein the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
3. The method of claim 1, wherein the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
4. The method of claim 1, wherein the plurality of gene sets comprises one, two, three, four, five, or six gene sets listed in Table 1.
5. The method of claim 1, wherein the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
6. The method of claim 5, wherein the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
7. The method of claim 1, wherein the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database.
8. The method of claim 7, wherein the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
9. The method of any one of claims 1 to 8, wherein the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
10. The method of any one of claims 1 to 9, further comprising obtaining the biological sample of said subject.
11. The method of claim 10, wherein said biological sample is a solid tumor or liquid biopsy.
12. The method of claim 10, wherein said biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
13. The method of claim 10, wherein said biological sample comprises cancer tissue.
14. The method of claim 13, wherein said cancer tissue comprises tumor-infiltrating immune cells.
15. The method of claim 13, wherein said biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
16. The method of any one of claims 1 to 15, further comprising processing said biological sample to prevent or inhibit tissue degradation.
17. The method of claim 16, wherein said biological sample is processed into a formalin-fixed paraffin-embedded sample.
18. The method of any one of claims 1 to 17, further comprising extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data.
19. The method of claim 18, wherein said RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
20. The method of claim 1, wherein said disease or condition is cancer.
21. The method of claim 20, wherein said cancer is a solid cancer or a hematopoietic cancer.
22. The method of claim 20, wherein said cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
23. The method of claim 22, further comprising selecting said subject for prediction of said treatment outcome based on said status.
24. The method of claim 23, wherein said treatment outcome corresponds to one or more cancer treatments.
25. The method of claim 24, wherein said one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
26. The method of claim 24, wherein said subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
27. The method of claim 26, wherein said subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
28. The method of claim 27, further comprising selecting said subject for generating said determination indicative of said treatment outcome based on a current status of said disease or condition.
29. The method of any one of claims 1 to 28, wherein said subject is treated based at least on said determination indicative of said treatment outcome.
30. The method of any one of claims 1 to 29, wherein said subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
31. A computer-implemented system for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising a processor and non- transitory computer readable storage medium comprising instructions that, when executed by the processor, cause the processor to: obtain gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conduct a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; process, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and generate a determination indicative of the treatment outcome based on the output.
32. The system of claim 31, wherein the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
33. The system of claim 31 or 32, wherein the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
34. The system of any one of claims 31 to 33, wherein the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1.
35. The system of any one of claims 31 to 34, wherein the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
36. The system of any one of claims 31 to 35, wherein the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
37. The system of any one of claims 31 to 36, wherein the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database.
38. The system of any one of claims 31 to 37, wherein the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
39. The system of any one of claims 31 to 38, wherein the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
40. The system of any one of claims 31 to 39, wherein the processor is configured to obtain the gene expression data for the biological sample of said subject from a database.
41. The system of any one of claims 31 to 40, wherein said biological sample is a solid tumor or liquid biopsy.
42. The system of any one of claims 31 to 41, wherein said biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
43. The system of any one of claims 31 to 42, wherein said biological sample comprises cancer tissue.
44. The system of claim 43, wherein said cancer tissue comprises tumor-infiltrating immune cells.
45. The system of claim 43, wherein said biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
46. The system of any one of claims 43 to 45, wherein said biological sample is processed to prevent or inhibit tissue degradation.
47. The system of any one of claims 31 to 46, wherein said biological sample is processed into a formalin-fixed paraffin-embedded sample.
48. The system of any one of claims 31 to 47, wherein the RNA is extracted from said biological sample, an RNA library is generated from said extracted RNA, and RNA-Seq is performed on the RNA library to generate said gene expression data.
49. The system of claim 48, wherein said RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
50. The system of any one of claims 31 to 49, wherein said disease or condition is cancer.
51. The system of claim 50, wherein said cancer is a solid cancer or a hematopoietic cancer.
52. The system of claim 50, wherein said cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
53. The system of claim 52, wherein said subject is selected for prediction of said treatment outcome based on said status.
54. The system of claim 53, wherein said treatment outcome corresponds to one or more cancer treatments.
55. The system of claim 54, wherein said one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
56. The system of any one of claims 1 to 55, wherein said subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
57. The system of claim 56, wherein said subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
58. The system of any one of claims 31 to 57, wherein said subject is selected for evaluation to generate said determination indicative of said treatment outcome based on a current status of said disease or condition.
59. The system of any one of claims 31 to 58, wherein said subject is treated based at least on said determination indicative of said treatment outcome.
60. The system of any one of claims 31 to 59, wherein said subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
61. A method for generating a trained machine learning model configured to generate a prediction of treatment outcome, comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
62. The method of claim 61, wherein said plurality of biological samples are obtained from said subjects prior to receiving said treatment and said subjects are classified according to said treatment outcome after receiving said treatment.
63. The method of claim 61 or 62, further comprising configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
PCT/US2023/023681 2022-05-27 2023-05-26 Machine learning systems and methods for gene set enrichment analysis and scoring WO2023230321A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263346718P 2022-05-27 2022-05-27
US63/346,718 2022-05-27

Publications (1)

Publication Number Publication Date
WO2023230321A1 true WO2023230321A1 (en) 2023-11-30

Family

ID=88919965

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/023681 WO2023230321A1 (en) 2022-05-27 2023-05-26 Machine learning systems and methods for gene set enrichment analysis and scoring

Country Status (1)

Country Link
WO (1) WO2023230321A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006103442A2 (en) * 2005-04-01 2006-10-05 Ncc Technology Ventures Pte. Ltd. Materials and methods relating to breast cancer classification
WO2019109089A1 (en) * 2017-12-01 2019-06-06 Illumina, Inc. Systems and methods for assessing drug efficacy
JP2020178667A (en) * 2019-04-26 2020-11-05 国立大学法人 東京大学 Prediction method of effect and prognosis of cancer treatment, and selection method of treatment means
WO2021092224A1 (en) * 2019-11-05 2021-05-14 Cofactor Genomics, Inc. Methods and systems of processing complex data sets using artificial intelligence and deconvolution

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006103442A2 (en) * 2005-04-01 2006-10-05 Ncc Technology Ventures Pte. Ltd. Materials and methods relating to breast cancer classification
WO2019109089A1 (en) * 2017-12-01 2019-06-06 Illumina, Inc. Systems and methods for assessing drug efficacy
JP2020178667A (en) * 2019-04-26 2020-11-05 国立大学法人 東京大学 Prediction method of effect and prognosis of cancer treatment, and selection method of treatment means
WO2021092224A1 (en) * 2019-11-05 2021-05-14 Cofactor Genomics, Inc. Methods and systems of processing complex data sets using artificial intelligence and deconvolution

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZALA, J. ET AL.: "Ranking metrics in gene set enrichment analysis : do they matter", BMC BIOINFORMATICS, vol. 18, 2017, pages 1 - 12, XP021244998, DOI: 10.1186/s12859-017-1674-0 *

Similar Documents

Publication Publication Date Title
Xiao et al. Prediction of lncRNA-protein interactions using HeteSim scores based on heterogeneous networks
Rodin et al. The landscape of somatic mutation in cerebral cortex of autistic and neurotypical individuals revealed by ultra-deep whole-genome sequencing
DeBoever et al. Large-scale profiling reveals the influence of genetic variation on gene expression in human induced pluripotent stem cells
JP6987786B2 (en) Detection and diagnosis of cancer evolution
US11640405B2 (en) Methods for analyzing genotypes
JP7394169B2 (en) Method and system for detecting common interstitial pneumonia
Desvignes et al. miRNA analysis with Prost! reveals evolutionary conservation of organ-enriched expression and post-transcriptional modifications in three-spined stickleback and zebrafish
Park et al. Exome-wide evaluation of rare coding variants using electronic health records identifies new gene–phenotype associations
WO2018223066A1 (en) Methods and systems for identifying or monitoring lung disease
CN113228190A (en) Tumor classification based on predicted tumor mutation burden
Strunz et al. A mega-analysis of expression quantitative trait loci in retinal tissue
CN109563544A (en) The diagnostic assay of urine monitoring for bladder cancer
WO2020028989A1 (en) Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection
US20230160019A1 (en) Rna markers and methods for identifying colon cell proliferative disorders
US20230348980A1 (en) Systems and methods of detecting a risk of alzheimer's disease using a circulating-free mrna profiling assay
Li et al. De novo transcriptome sequencing and analysis of male, pseudo-male and female yellow perch, Perca flavescens
EP3743518A1 (en) Methods and systems for abnormality detection in the patterns of nucleic acids
Chiou et al. Multiregion transcriptomic profiling of the primate brain reveals signatures of aging and the social environment
Rodin et al. The landscape of mutational mosaicism in autistic and normal human cerebral cortex
WO2023230321A1 (en) Machine learning systems and methods for gene set enrichment analysis and scoring
CN114627970A (en) Prognosis model of scorching-related lncRNA of colon adenocarcinoma and construction method and application thereof
Fischer et al. Genome sequences of Tropheus moorii and Petrochromis trewavasae, two eco-morphologically divergent cichlid fishes endemic to Lake Tanganyika
Smits et al. Multi-omics analyses identify transcription factor interplay in corneal epithelial fate determination and disease
Yapar et al. Convergent evolution of primate testis transcriptomes reflects mating strategy
Liu Accurate, Systematic and Integrated Inference of Omics Data Using Novel Bioinformatics Approaches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23812625

Country of ref document: EP

Kind code of ref document: A1