WO2023230321A1 - Machine learning systems and methods for gene set enrichment analysis and scoring - Google Patents
Machine learning systems and methods for gene set enrichment analysis and scoring Download PDFInfo
- Publication number
- WO2023230321A1 WO2023230321A1 PCT/US2023/023681 US2023023681W WO2023230321A1 WO 2023230321 A1 WO2023230321 A1 WO 2023230321A1 US 2023023681 W US2023023681 W US 2023023681W WO 2023230321 A1 WO2023230321 A1 WO 2023230321A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gene
- cancer
- gene sets
- treatment
- cells
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 129
- 238000010801 machine learning Methods 0.000 title claims abstract description 68
- 238000010199 gene set enrichment analysis Methods 0.000 title claims abstract description 34
- 238000011282 treatment Methods 0.000 claims abstract description 139
- 206010028980 Neoplasm Diseases 0.000 claims abstract description 117
- 201000011510 cancer Diseases 0.000 claims abstract description 97
- 230000014509 gene expression Effects 0.000 claims abstract description 59
- 108090000623 proteins and genes Proteins 0.000 claims description 202
- 239000000523 sample Substances 0.000 claims description 91
- 239000012472 biological sample Substances 0.000 claims description 71
- 210000004027 cell Anatomy 0.000 claims description 68
- 238000004422 calculation algorithm Methods 0.000 claims description 56
- 238000012163 sequencing technique Methods 0.000 claims description 44
- 230000004044 response Effects 0.000 claims description 33
- 210000001519 tissue Anatomy 0.000 claims description 30
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 28
- 201000010099 disease Diseases 0.000 claims description 26
- 238000012549 training Methods 0.000 claims description 23
- 238000012545 processing Methods 0.000 claims description 20
- 210000002865 immune cell Anatomy 0.000 claims description 19
- 108091032973 (ribonucleotides)n+m Proteins 0.000 claims description 16
- 238000009169 immunotherapy Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 14
- 210000001185 bone marrow Anatomy 0.000 claims description 13
- 210000000130 stem cell Anatomy 0.000 claims description 12
- 238000002560 therapeutic procedure Methods 0.000 claims description 11
- 210000004369 blood Anatomy 0.000 claims description 10
- 239000008280 blood Substances 0.000 claims description 10
- 238000002512 chemotherapy Methods 0.000 claims description 10
- 238000001794 hormone therapy Methods 0.000 claims description 10
- 108020004999 messenger RNA Proteins 0.000 claims description 10
- 238000001959 radiotherapy Methods 0.000 claims description 10
- 208000016691 refractory malignant neoplasm Diseases 0.000 claims description 10
- 238000001356 surgical procedure Methods 0.000 claims description 10
- 238000002626 targeted therapy Methods 0.000 claims description 10
- 239000012530 fluid Substances 0.000 claims description 9
- 210000002381 plasma Anatomy 0.000 claims description 9
- 238000007481 next generation sequencing Methods 0.000 claims description 8
- 230000036961 partial effect Effects 0.000 claims description 8
- 210000002966 serum Anatomy 0.000 claims description 8
- 210000002700 urine Anatomy 0.000 claims description 8
- 210000004381 amniotic fluid Anatomy 0.000 claims description 7
- 210000001175 cerebrospinal fluid Anatomy 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 7
- 210000002751 lymph Anatomy 0.000 claims description 7
- 210000003296 saliva Anatomy 0.000 claims description 7
- 210000001138 tear Anatomy 0.000 claims description 7
- 101150084750 1 gene Proteins 0.000 claims description 6
- 101150028074 2 gene Proteins 0.000 claims description 6
- 101150094083 24 gene Proteins 0.000 claims description 6
- 101150090724 3 gene Proteins 0.000 claims description 6
- 101150033839 4 gene Proteins 0.000 claims description 6
- 101150096316 5 gene Proteins 0.000 claims description 6
- 101150039504 6 gene Proteins 0.000 claims description 6
- 206010003445 Ascites Diseases 0.000 claims description 6
- 238000003559 RNA-seq method Methods 0.000 claims description 6
- 210000000941 bile Anatomy 0.000 claims description 6
- 230000001900 immune effect Effects 0.000 claims description 6
- 230000004879 molecular function Effects 0.000 claims description 6
- 230000003990 molecular pathway Effects 0.000 claims description 6
- 231100000590 oncogenic Toxicity 0.000 claims description 6
- 230000002246 oncogenic effect Effects 0.000 claims description 6
- 210000000056 organ Anatomy 0.000 claims description 6
- 230000001105 regulatory effect Effects 0.000 claims description 6
- 239000007787 solid Substances 0.000 claims description 6
- 101150072531 10 gene Proteins 0.000 claims description 5
- 101150000874 11 gene Proteins 0.000 claims description 5
- 101150066838 12 gene Proteins 0.000 claims description 5
- 101150025032 13 gene Proteins 0.000 claims description 5
- 101150082072 14 gene Proteins 0.000 claims description 5
- 101150029062 15 gene Proteins 0.000 claims description 5
- 101150076401 16 gene Proteins 0.000 claims description 5
- 101150016096 17 gene Proteins 0.000 claims description 5
- 101150078635 18 gene Proteins 0.000 claims description 5
- 101150040471 19 gene Proteins 0.000 claims description 5
- 101150098072 20 gene Proteins 0.000 claims description 5
- 101150042997 21 gene Proteins 0.000 claims description 5
- 101150092328 22 gene Proteins 0.000 claims description 5
- 101150029857 23 gene Proteins 0.000 claims description 5
- 101150101112 7 gene Proteins 0.000 claims description 5
- 101150044182 8 gene Proteins 0.000 claims description 5
- 101150106774 9 gene Proteins 0.000 claims description 5
- 206010067484 Adverse reaction Diseases 0.000 claims description 5
- 206010061218 Inflammation Diseases 0.000 claims description 5
- 102000008070 Interferon-gamma Human genes 0.000 claims description 5
- 108010074328 Interferon-gamma Proteins 0.000 claims description 5
- 206010070308 Refractory cancer Diseases 0.000 claims description 5
- 210000001744 T-lymphocyte Anatomy 0.000 claims description 5
- 230000006838 adverse reaction Effects 0.000 claims description 5
- 230000030741 antigen processing and presentation Effects 0.000 claims description 5
- 230000031018 biological processes and functions Effects 0.000 claims description 5
- 230000003013 cytotoxicity Effects 0.000 claims description 5
- 231100000135 cytotoxicity Toxicity 0.000 claims description 5
- 230000004547 gene signature Effects 0.000 claims description 5
- 201000005787 hematologic cancer Diseases 0.000 claims description 5
- 208000024200 hematopoietic and lymphoid system neoplasm Diseases 0.000 claims description 5
- 230000004054 inflammatory process Effects 0.000 claims description 5
- 229960003130 interferon gamma Drugs 0.000 claims description 5
- 238000011528 liquid biopsy Methods 0.000 claims description 5
- 238000002493 microarray Methods 0.000 claims description 5
- 238000003753 real-time PCR Methods 0.000 claims description 5
- 230000008354 tissue degradation Effects 0.000 claims description 5
- 230000008595 infiltration Effects 0.000 claims description 4
- 238000001764 infiltration Methods 0.000 claims description 4
- 230000011664 signaling Effects 0.000 claims description 4
- 229920002477 rna polymer Polymers 0.000 description 58
- 102000039446 nucleic acids Human genes 0.000 description 24
- 108020004707 nucleic acids Proteins 0.000 description 24
- 150000007523 nucleic acids Chemical class 0.000 description 24
- 239000002773 nucleotide Substances 0.000 description 22
- 125000003729 nucleotide group Chemical group 0.000 description 22
- 239000002609 medium Substances 0.000 description 16
- 230000000670 limiting effect Effects 0.000 description 15
- 239000000090 biomarker Substances 0.000 description 14
- 238000013459 approach Methods 0.000 description 12
- 238000004590 computer program Methods 0.000 description 12
- 208000000102 Squamous Cell Carcinoma of Head and Neck Diseases 0.000 description 11
- 201000000459 head and neck squamous cell carcinoma Diseases 0.000 description 11
- 238000012706 support-vector machine Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 238000000605 extraction Methods 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 102100024216 Programmed cell death 1 ligand 1 Human genes 0.000 description 7
- 238000004891 communication Methods 0.000 description 7
- 230000035945 sensitivity Effects 0.000 description 7
- 108010074708 B7-H1 Antigen Proteins 0.000 description 6
- 108020004414 DNA Proteins 0.000 description 6
- 102000053602 DNA Human genes 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 6
- 210000003734 kidney Anatomy 0.000 description 6
- 238000000513 principal component analysis Methods 0.000 description 6
- 102000003960 Ligases Human genes 0.000 description 5
- 108090000364 Ligases Proteins 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 5
- 210000000254 ciliated cell Anatomy 0.000 description 5
- 210000002919 epithelial cell Anatomy 0.000 description 5
- 210000002950 fibroblast Anatomy 0.000 description 5
- 230000037361 pathway Effects 0.000 description 5
- 210000002363 skeletal muscle cell Anatomy 0.000 description 5
- 102000040650 (ribonucleotides)n+m Human genes 0.000 description 4
- 102000008158 DNA Ligase ATP Human genes 0.000 description 4
- 108010060248 DNA Ligase ATP Proteins 0.000 description 4
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 4
- 241001465754 Metazoa Species 0.000 description 4
- 238000007792 addition Methods 0.000 description 4
- 230000004075 alteration Effects 0.000 description 4
- 210000003719 b-lymphocyte Anatomy 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 230000002068 genetic effect Effects 0.000 description 4
- 239000007788 liquid Substances 0.000 description 4
- 210000001806 memory b lymphocyte Anatomy 0.000 description 4
- 210000004296 naive t lymphocyte Anatomy 0.000 description 4
- 210000000822 natural killer cell Anatomy 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 210000003289 regulatory T cell Anatomy 0.000 description 4
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 3
- KFZMGEQAYNKOFK-UHFFFAOYSA-N Isopropanol Chemical compound CC(C)O KFZMGEQAYNKOFK-UHFFFAOYSA-N 0.000 description 3
- OKKJLVBELUTLKV-UHFFFAOYSA-N Methanol Chemical compound OC OKKJLVBELUTLKV-UHFFFAOYSA-N 0.000 description 3
- DZBUGLKDJFMEHC-UHFFFAOYSA-N acridine Chemical compound C1=CC=CC2=CC3=CC=CC=C3N=C21 DZBUGLKDJFMEHC-UHFFFAOYSA-N 0.000 description 3
- 210000004413 cardiac myocyte Anatomy 0.000 description 3
- ZYGHJZDHTFUPRJ-UHFFFAOYSA-N coumarin Chemical compound C1=CC=C2OC(=O)C=CC2=C1 ZYGHJZDHTFUPRJ-UHFFFAOYSA-N 0.000 description 3
- 239000000975 dye Substances 0.000 description 3
- 230000007705 epithelial mesenchymal transition Effects 0.000 description 3
- 230000002519 immonomodulatory effect Effects 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 239000004973 liquid crystal related substance Substances 0.000 description 3
- 210000002540 macrophage Anatomy 0.000 description 3
- 210000000110 microvilli Anatomy 0.000 description 3
- 230000000869 mutational effect Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 102000054765 polymorphisms of proteins Human genes 0.000 description 3
- 210000002248 primary sensory neuron Anatomy 0.000 description 3
- 210000002345 respiratory system Anatomy 0.000 description 3
- 210000002536 stromal cell Anatomy 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- QCPFFGGFHNZBEP-UHFFFAOYSA-N 4,5,6,7-tetrachloro-3',6'-dihydroxyspiro[2-benzofuran-3,9'-xanthene]-1-one Chemical compound O1C(=O)C(C(=C(Cl)C(Cl)=C2Cl)Cl)=C2C21C1=CC=C(O)C=C1OC1=CC(O)=CC=C21 QCPFFGGFHNZBEP-UHFFFAOYSA-N 0.000 description 2
- SJQRQOKXQKVJGJ-UHFFFAOYSA-N 5-(2-aminoethylamino)naphthalene-1-sulfonic acid Chemical compound C1=CC=C2C(NCCN)=CC=CC2=C1S(O)(=O)=O SJQRQOKXQKVJGJ-UHFFFAOYSA-N 0.000 description 2
- 229920001621 AMOLED Polymers 0.000 description 2
- 210000004366 CD4-positive T-lymphocyte Anatomy 0.000 description 2
- HEDRZPFGACZZDS-UHFFFAOYSA-N Chloroform Chemical compound ClC(Cl)Cl HEDRZPFGACZZDS-UHFFFAOYSA-N 0.000 description 2
- IAZDPXIOMUYVGZ-UHFFFAOYSA-N Dimethylsulphoxide Chemical compound CS(C)=O IAZDPXIOMUYVGZ-UHFFFAOYSA-N 0.000 description 2
- 102000004190 Enzymes Human genes 0.000 description 2
- 108090000790 Enzymes Proteins 0.000 description 2
- LYCAIKOWRPUZTN-UHFFFAOYSA-N Ethylene glycol Chemical compound OCCO LYCAIKOWRPUZTN-UHFFFAOYSA-N 0.000 description 2
- WSFSSNUMVMOOMR-UHFFFAOYSA-N Formaldehyde Chemical compound O=C WSFSSNUMVMOOMR-UHFFFAOYSA-N 0.000 description 2
- SIKJAQJRHWYJAI-UHFFFAOYSA-N Indole Chemical compound C1=CC=C2NC=CC2=C1 SIKJAQJRHWYJAI-UHFFFAOYSA-N 0.000 description 2
- UFWIBTONFRDIAS-UHFFFAOYSA-N Naphthalene Chemical compound C1=CC=CC2=CC=CC=C21 UFWIBTONFRDIAS-UHFFFAOYSA-N 0.000 description 2
- 210000004241 Th2 cell Anatomy 0.000 description 2
- 210000001789 adipocyte Anatomy 0.000 description 2
- 210000004100 adrenal gland Anatomy 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- MWPLVEDNUUSJAV-UHFFFAOYSA-N anthracene Chemical compound C1=CC=CC2=CC3=CC=CC=C3C=C21 MWPLVEDNUUSJAV-UHFFFAOYSA-N 0.000 description 2
- 210000001130 astrocyte Anatomy 0.000 description 2
- IOJUPLGTWVMSFF-UHFFFAOYSA-N benzothiazole Chemical compound C1=CC=C2SC=NC2=C1 IOJUPLGTWVMSFF-UHFFFAOYSA-N 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000008777 canonical pathway Effects 0.000 description 2
- 239000000298 carbocyanine Substances 0.000 description 2
- 210000002777 columnar cell Anatomy 0.000 description 2
- 238000004883 computer application Methods 0.000 description 2
- 235000001671 coumarin Nutrition 0.000 description 2
- 210000001151 cytotoxic T lymphocyte Anatomy 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 210000004443 dendritic cell Anatomy 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 208000035475 disorder Diseases 0.000 description 2
- 239000012636 effector Substances 0.000 description 2
- 210000001062 endolymphatic sac Anatomy 0.000 description 2
- 210000004696 endometrium Anatomy 0.000 description 2
- 230000003511 endothelial effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 210000003743 erythrocyte Anatomy 0.000 description 2
- VYXSBFYARXAAKO-UHFFFAOYSA-N ethyl 2-[3-(ethylamino)-6-ethylimino-2,7-dimethylxanthen-9-yl]benzoate;hydron;chloride Chemical compound [Cl-].C1=2C=C(C)C(NCC)=CC=2OC2=CC(=[NH+]CC)C(C)=CC2=C1C1=CC=CC=C1C(=O)OCC VYXSBFYARXAAKO-UHFFFAOYSA-N 0.000 description 2
- GNBHRKFJIUUOQI-UHFFFAOYSA-N fluorescein Chemical compound O1C(=O)C2=CC=CC=C2C21C1=CC=C(O)C=C1OC1=CC(O)=CC=C21 GNBHRKFJIUUOQI-UHFFFAOYSA-N 0.000 description 2
- 238000007672 fourth generation sequencing Methods 0.000 description 2
- 210000001035 gastrointestinal tract Anatomy 0.000 description 2
- 210000004907 gland Anatomy 0.000 description 2
- 210000003494 hepatocyte Anatomy 0.000 description 2
- 210000004185 liver Anatomy 0.000 description 2
- 230000033001 locomotion Effects 0.000 description 2
- 210000004698 lymphocyte Anatomy 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 210000002752 melanocyte Anatomy 0.000 description 2
- 210000003071 memory t lymphocyte Anatomy 0.000 description 2
- 210000003584 mesangial cell Anatomy 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000001616 monocyte Anatomy 0.000 description 2
- 210000004699 muscle spindle Anatomy 0.000 description 2
- 108091008709 muscle spindles Proteins 0.000 description 2
- 210000000066 myeloid cell Anatomy 0.000 description 2
- 210000004160 naive b lymphocyte Anatomy 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 210000003101 oviduct Anatomy 0.000 description 2
- 239000013610 patient sample Substances 0.000 description 2
- 210000003668 pericyte Anatomy 0.000 description 2
- QWYZFXLSWMXLDM-UHFFFAOYSA-M pinacyanol iodide Chemical compound [I-].C1=CC2=CC=CC=C2N(CC)C1=CC=CC1=CC=C(C=CC=C2)C2=[N+]1CC QWYZFXLSWMXLDM-UHFFFAOYSA-M 0.000 description 2
- 210000004180 plasmocyte Anatomy 0.000 description 2
- 108091007428 primary miRNA Proteins 0.000 description 2
- 210000002307 prostate Anatomy 0.000 description 2
- APTZNLHMIGJTEW-UHFFFAOYSA-N pyraflufen-ethyl Chemical compound C1=C(Cl)C(OCC(=O)OCC)=CC(C=2C(=C(OC(F)F)N(C)N=2)Cl)=C1F APTZNLHMIGJTEW-UHFFFAOYSA-N 0.000 description 2
- BBEAQIROQSPTKN-UHFFFAOYSA-N pyrene Chemical compound C1=CC=C2C=CC3=CC=CC4=CC=C1C2=C43 BBEAQIROQSPTKN-UHFFFAOYSA-N 0.000 description 2
- 238000012175 pyrosequencing Methods 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 230000004043 responsiveness Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 239000001022 rhodamine dye Substances 0.000 description 2
- 125000002652 ribonucleotide group Chemical group 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000003248 secreting effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000007841 sequencing by ligation Methods 0.000 description 2
- 210000001057 smooth muscle myoblast Anatomy 0.000 description 2
- 210000000329 smooth muscle myocyte Anatomy 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 210000002784 stomach Anatomy 0.000 description 2
- 210000000645 stria vascularis Anatomy 0.000 description 2
- 230000004083 survival effect Effects 0.000 description 2
- 210000001685 thyroid gland Anatomy 0.000 description 2
- 238000010361 transduction Methods 0.000 description 2
- 230000026683 transduction Effects 0.000 description 2
- 210000003932 urinary bladder Anatomy 0.000 description 2
- 210000004291 uterus Anatomy 0.000 description 2
- BCMCBBGGLRIHSE-UHFFFAOYSA-N 1,3-benzoxazole Chemical compound C1=CC=C2OC=NC2=C1 BCMCBBGGLRIHSE-UHFFFAOYSA-N 0.000 description 1
- BGGCPIFVRJFAKF-UHFFFAOYSA-N 1-[4-(1,3-benzoxazol-2-yl)phenyl]pyrrole-2,5-dione Chemical compound O=C1C=CC(=O)N1C1=CC=C(C=2OC3=CC=CC=C3N=2)C=C1 BGGCPIFVRJFAKF-UHFFFAOYSA-N 0.000 description 1
- RUFPHBVGCFYCNW-UHFFFAOYSA-N 1-naphthylamine Chemical compound C1=CC=C2C(N)=CC=CC2=C1 RUFPHBVGCFYCNW-UHFFFAOYSA-N 0.000 description 1
- HIYWOHBEPVGIQN-UHFFFAOYSA-N 1h-benzo[g]indole Chemical compound C1=CC=CC2=C(NC=C3)C3=CC=C21 HIYWOHBEPVGIQN-UHFFFAOYSA-N 0.000 description 1
- 101150055869 25 gene Proteins 0.000 description 1
- 101150112497 26 gene Proteins 0.000 description 1
- 101150057657 27 gene Proteins 0.000 description 1
- 101150106899 28 gene Proteins 0.000 description 1
- 101150051922 29 gene Proteins 0.000 description 1
- 101150110188 30 gene Proteins 0.000 description 1
- VTRBOZNMGVDGHY-UHFFFAOYSA-N 6-(4-methylanilino)naphthalene-2-sulfonic acid Chemical compound C1=CC(C)=CC=C1NC1=CC=C(C=C(C=C2)S(O)(=O)=O)C2=C1 VTRBOZNMGVDGHY-UHFFFAOYSA-N 0.000 description 1
- WQZIDRAQTRIQDX-UHFFFAOYSA-N 6-carboxy-x-rhodamine Chemical compound OC(=O)C1=CC=C(C([O-])=O)C=C1C(C1=CC=2CCCN3CCCC(C=23)=C1O1)=C2C1=C(CCC1)C3=[N+]1CCCC3=C2 WQZIDRAQTRIQDX-UHFFFAOYSA-N 0.000 description 1
- BZTDTCNHAFUJOG-UHFFFAOYSA-N 6-carboxyfluorescein Chemical compound C12=CC=C(O)C=C2OC2=CC(O)=CC=C2C11OC(=O)C2=CC=C(C(=O)O)C=C21 BZTDTCNHAFUJOG-UHFFFAOYSA-N 0.000 description 1
- UKLNSYRWDXRTER-UHFFFAOYSA-N 7-isocyanato-3-phenylchromen-2-one Chemical compound O=C1OC2=CC(N=C=O)=CC=C2C=C1C1=CC=CC=C1 UKLNSYRWDXRTER-UHFFFAOYSA-N 0.000 description 1
- NLSUMBWPPJUVST-UHFFFAOYSA-N 9-isothiocyanatoacridine Chemical compound C1=CC=C2C(N=C=S)=C(C=CC=C3)C3=NC2=C1 NLSUMBWPPJUVST-UHFFFAOYSA-N 0.000 description 1
- 241000271566 Aves Species 0.000 description 1
- 206010005003 Bladder cancer Diseases 0.000 description 1
- 241000282465 Canis Species 0.000 description 1
- 108091026890 Coding region Proteins 0.000 description 1
- 108091035707 Consensus sequence Proteins 0.000 description 1
- 241000588724 Escherichia coli Species 0.000 description 1
- SXRSQZLOMIGNAQ-UHFFFAOYSA-N Glutaraldehyde Chemical compound O=CCCCC=O SXRSQZLOMIGNAQ-UHFFFAOYSA-N 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 210000002361 Megakaryocyte Progenitor Cell Anatomy 0.000 description 1
- 108700011259 MicroRNAs Proteins 0.000 description 1
- KWYHDKDOAIKMQN-UHFFFAOYSA-N N,N,N',N'-tetramethylethylenediamine Chemical compound CN(C)CCN(C)C KWYHDKDOAIKMQN-UHFFFAOYSA-N 0.000 description 1
- 108700019961 Neoplasm Genes Proteins 0.000 description 1
- 102000048850 Neoplasm Genes Human genes 0.000 description 1
- 108091034117 Oligonucleotide Proteins 0.000 description 1
- ZCQWOFVYLHDMMC-UHFFFAOYSA-N Oxazole Chemical compound C1=COC=N1 ZCQWOFVYLHDMMC-UHFFFAOYSA-N 0.000 description 1
- ISWSIDIOOBJBQZ-UHFFFAOYSA-N Phenol Chemical compound OC1=CC=CC=C1 ISWSIDIOOBJBQZ-UHFFFAOYSA-N 0.000 description 1
- 241000288906 Primates Species 0.000 description 1
- 206010036790 Productive cough Diseases 0.000 description 1
- 238000002123 RNA extraction Methods 0.000 description 1
- 108091028664 Ribonucleotide Proteins 0.000 description 1
- 108020004459 Small interfering RNA Proteins 0.000 description 1
- PJANXHGTPQOBST-VAWYXSNFSA-N Stilbene Natural products C=1C=CC=CC=1/C=C/C1=CC=CC=C1 PJANXHGTPQOBST-VAWYXSNFSA-N 0.000 description 1
- 108091046869 Telomeric non-coding RNA Proteins 0.000 description 1
- FZWLAAWBMGSTSO-UHFFFAOYSA-N Thiazole Chemical compound C1=CSC=N1 FZWLAAWBMGSTSO-UHFFFAOYSA-N 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 108020004566 Transfer RNA Proteins 0.000 description 1
- 239000007984 Tris EDTA buffer Substances 0.000 description 1
- 239000007983 Tris buffer Substances 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 241000251539 Vertebrata <Metazoa> Species 0.000 description 1
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- DPKHZNPWBDQZCN-UHFFFAOYSA-N acridine orange free base Chemical compound C1=CC(N(C)C)=CC2=NC3=CC(N(C)C)=CC=C3C=C21 DPKHZNPWBDQZCN-UHFFFAOYSA-N 0.000 description 1
- 150000001251 acridines Chemical class 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000033289 adaptive immune response Effects 0.000 description 1
- 125000003277 amino group Chemical group 0.000 description 1
- 238000013103 analytical ultracentrifugation Methods 0.000 description 1
- RWZYAGGXGHYGMB-UHFFFAOYSA-N anthranilic acid Chemical compound NC1=CC=CC=C1C(O)=O RWZYAGGXGHYGMB-UHFFFAOYSA-N 0.000 description 1
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 150000001491 aromatic compounds Chemical class 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 210000002453 autonomic neuron Anatomy 0.000 description 1
- 210000000270 basal cell Anatomy 0.000 description 1
- 210000003651 basophil Anatomy 0.000 description 1
- HMFHBZSHGGEWLO-TXICZTDVSA-N beta-D-ribose Chemical group OC[C@H]1O[C@@H](O)[C@H](O)[C@@H]1O HMFHBZSHGGEWLO-TXICZTDVSA-N 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 210000001772 blood platelet Anatomy 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 210000000481 breast Anatomy 0.000 description 1
- 210000001593 brown adipocyte Anatomy 0.000 description 1
- 210000000465 brunner gland Anatomy 0.000 description 1
- 210000002533 bulbourethral gland Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000002619 cancer immunotherapy Methods 0.000 description 1
- 230000010261 cell growth Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 210000003169 central nervous system Anatomy 0.000 description 1
- 210000003679 cervix uteri Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 210000001612 chondrocyte Anatomy 0.000 description 1
- 210000002987 choroid plexus Anatomy 0.000 description 1
- 210000003737 chromaffin cell Anatomy 0.000 description 1
- 238000010224 classification analysis Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 239000002299 complementary DNA Substances 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 210000002808 connective tissue Anatomy 0.000 description 1
- 210000003239 corneal fibroblast Anatomy 0.000 description 1
- 229960000956 coumarin Drugs 0.000 description 1
- 150000004775 coumarins Chemical class 0.000 description 1
- 239000012531 culture fluid Substances 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 210000005232 distal tubule cell Anatomy 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000002357 endometrial effect Effects 0.000 description 1
- 210000002889 endothelial cell Anatomy 0.000 description 1
- 210000003979 eosinophil Anatomy 0.000 description 1
- 210000003426 epidermal langerhans cell Anatomy 0.000 description 1
- 210000003238 esophagus Anatomy 0.000 description 1
- 230000029142 excretion Effects 0.000 description 1
- 210000003499 exocrine gland Anatomy 0.000 description 1
- 238000010195 expression analysis Methods 0.000 description 1
- GVEPBJHOBDJJJI-UHFFFAOYSA-N fluoranthrene Natural products C1=CC(C2=CC=CC=C22)=C3C2=CC=CC3=C1 GVEPBJHOBDJJJI-UHFFFAOYSA-N 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 210000000232 gallbladder Anatomy 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 210000004602 germ cell Anatomy 0.000 description 1
- 210000002175 goblet cell Anatomy 0.000 description 1
- 208000035474 group of disease Diseases 0.000 description 1
- 239000001963 growth medium Substances 0.000 description 1
- 210000004919 hair shaft Anatomy 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 210000003958 hematopoietic stem cell Anatomy 0.000 description 1
- 150000002390 heteroarenes Chemical class 0.000 description 1
- 208000002557 hidradenitis Diseases 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 210000003630 histaminocyte Anatomy 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- 229940088597 hormone Drugs 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 125000002887 hydroxy group Chemical group [H]O* 0.000 description 1
- WGCNASOHLSPBMP-UHFFFAOYSA-N hydroxyacetaldehyde Natural products OCC=O WGCNASOHLSPBMP-UHFFFAOYSA-N 0.000 description 1
- PZOUSPYUWWUPPK-UHFFFAOYSA-N indole Natural products CC1=CC=CC2=C1C=CN2 PZOUSPYUWWUPPK-UHFFFAOYSA-N 0.000 description 1
- RKJUIXBNRJVNHR-UHFFFAOYSA-N indolenine Natural products C1=CC=C2CC=NC2=C1 RKJUIXBNRJVNHR-UHFFFAOYSA-N 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 210000002570 interstitial cell Anatomy 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 210000002510 keratinocyte Anatomy 0.000 description 1
- 210000003292 kidney cell Anatomy 0.000 description 1
- 210000001039 kidney glomerulus Anatomy 0.000 description 1
- 210000004561 lacrimal apparatus Anatomy 0.000 description 1
- 210000002332 leydig cell Anatomy 0.000 description 1
- 210000000210 loop of henle Anatomy 0.000 description 1
- 235000019689 luncheon sausage Nutrition 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 210000001165 lymph node Anatomy 0.000 description 1
- 210000005073 lymphatic endothelial cell Anatomy 0.000 description 1
- 210000003738 lymphoid progenitor cell Anatomy 0.000 description 1
- 210000001730 macula densa epithelial cell Anatomy 0.000 description 1
- 210000005075 mammary gland Anatomy 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 210000003593 megakaryocyte Anatomy 0.000 description 1
- 210000000135 megakaryocyte-erythroid progenitor cell Anatomy 0.000 description 1
- 201000001441 melanoma Diseases 0.000 description 1
- 210000004379 membrane Anatomy 0.000 description 1
- 239000012528 membrane Substances 0.000 description 1
- 210000002901 mesenchymal stem cell Anatomy 0.000 description 1
- WSFSSNUMVMOOMR-NJFSPNSNSA-N methanone Chemical compound O=[14CH2] WSFSSNUMVMOOMR-NJFSPNSNSA-N 0.000 description 1
- 239000002679 microRNA Substances 0.000 description 1
- 210000004925 microvascular endothelial cell Anatomy 0.000 description 1
- 210000004080 milk Anatomy 0.000 description 1
- 239000008267 milk Substances 0.000 description 1
- 235000013336 milk Nutrition 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000001823 molecular biology technique Methods 0.000 description 1
- 210000003550 mucous cell Anatomy 0.000 description 1
- 210000000107 myocyte Anatomy 0.000 description 1
- 125000005184 naphthylamino group Chemical group C1(=CC=CC2=CC=CC=C12)N* 0.000 description 1
- 210000003739 neck Anatomy 0.000 description 1
- 210000004498 neuroglial cell Anatomy 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 210000000440 neutrophil Anatomy 0.000 description 1
- 238000001821 nucleic acid purification Methods 0.000 description 1
- 210000001915 nurse cell Anatomy 0.000 description 1
- 235000015097 nutrients Nutrition 0.000 description 1
- 210000001517 olfactory receptor neuron Anatomy 0.000 description 1
- 210000004248 oligodendroglia Anatomy 0.000 description 1
- 210000002985 organ of corti Anatomy 0.000 description 1
- 210000000963 osteoblast Anatomy 0.000 description 1
- 210000002997 osteoclast Anatomy 0.000 description 1
- 210000001672 ovary Anatomy 0.000 description 1
- 210000001711 oxyntic cell Anatomy 0.000 description 1
- 210000000496 pancreas Anatomy 0.000 description 1
- 210000000277 pancreatic duct Anatomy 0.000 description 1
- 210000000608 photoreceptor cell Anatomy 0.000 description 1
- 108091008695 photoreceptors Proteins 0.000 description 1
- 230000001817 pituitary effect Effects 0.000 description 1
- 210000005134 plasmacytoid dendritic cell Anatomy 0.000 description 1
- 210000000557 podocyte Anatomy 0.000 description 1
- 229920001690 polydopamine Polymers 0.000 description 1
- 210000000229 preadipocyte Anatomy 0.000 description 1
- 238000001556 precipitation Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 description 1
- 208000037821 progressive disease Diseases 0.000 description 1
- 230000000272 proprioceptive effect Effects 0.000 description 1
- 210000000512 proximal kidney tubule Anatomy 0.000 description 1
- 238000000746 purification Methods 0.000 description 1
- 210000003742 purkinje fiber Anatomy 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000002285 radioactive effect Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000011347 resin Substances 0.000 description 1
- 229920005989 resin Polymers 0.000 description 1
- 230000000241 respiratory effect Effects 0.000 description 1
- 210000002830 rete testis Anatomy 0.000 description 1
- 210000001525 retina Anatomy 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 108020004418 ribosomal RNA Proteins 0.000 description 1
- YGSDEFSMJLZEOE-UHFFFAOYSA-M salicylate Chemical compound OC1=CC=CC=C1C([O-])=O YGSDEFSMJLZEOE-UHFFFAOYSA-M 0.000 description 1
- 229960001860 salicylate Drugs 0.000 description 1
- 210000003079 salivary gland Anatomy 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 210000001732 sebaceous gland Anatomy 0.000 description 1
- 210000004378 sebocyte Anatomy 0.000 description 1
- 210000000582 semen Anatomy 0.000 description 1
- 210000001625 seminal vesicle Anatomy 0.000 description 1
- 210000001044 sensory neuron Anatomy 0.000 description 1
- 210000003491 skin Anatomy 0.000 description 1
- 239000010454 slate Substances 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 210000003802 sputum Anatomy 0.000 description 1
- 208000024794 sputum Diseases 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- PJANXHGTPQOBST-UHFFFAOYSA-N stilbene Chemical compound C=1C=CC=CC=1C=CC1=CC=CC=C1 PJANXHGTPQOBST-UHFFFAOYSA-N 0.000 description 1
- 235000021286 stilbenes Nutrition 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 125000000472 sulfonyl group Chemical group *S(*)(=O)=O 0.000 description 1
- 210000000106 sweat gland Anatomy 0.000 description 1
- 210000001179 synovial fluid Anatomy 0.000 description 1
- 210000002437 synoviocyte Anatomy 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 210000000108 taste bud cell Anatomy 0.000 description 1
- 210000002435 tendon Anatomy 0.000 description 1
- 210000001550 testis Anatomy 0.000 description 1
- ABZLKHKQJHEPAX-UHFFFAOYSA-N tetramethylrhodamine Chemical compound C=12C=CC(N(C)C)=CC2=[O+]C2=CC(N(C)C)=CC=C2C=1C1=CC=CC=C1C([O-])=O ABZLKHKQJHEPAX-UHFFFAOYSA-N 0.000 description 1
- MPLHNVLQVRSVEE-UHFFFAOYSA-N texas red Chemical compound [O-]S(=O)(=O)C1=CC(S(Cl)(=O)=O)=CC=C1C(C1=CC=2CCCN3CCCC(C=23)=C1O1)=C2C1=C(CCC1)C3=[N+]1CCCC3=C2 MPLHNVLQVRSVEE-UHFFFAOYSA-N 0.000 description 1
- 210000003684 theca cell Anatomy 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 239000010409 thin film Substances 0.000 description 1
- 210000001541 thymus gland Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- LENZDBCJOHFCAS-UHFFFAOYSA-N tris Chemical compound OCC(N)(CO)CO LENZDBCJOHFCAS-UHFFFAOYSA-N 0.000 description 1
- 201000005112 urinary bladder cancer Diseases 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000002792 vascular Effects 0.000 description 1
- 201000010653 vesiculitis Diseases 0.000 description 1
- 230000001720 vestibular Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
- 210000000636 white adipocyte Anatomy 0.000 description 1
- 239000001018 xanthene dye Substances 0.000 description 1
- 150000003732 xanthenes Chemical class 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H20/00—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Definitions
- Cancer is a complex group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. Millions of new cases of cancer occur globally each year. Understanding the immune and tumor profile may help with diagnosis and treatment.
- the present disclosure discloses a method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition comprising obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and generating a determination indicative of the treatment outcome based on the output.
- the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
- the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
- the plurality of gene sets comprises 1, 2, 3, 4, 5, or 6 gene sets listed in Table 1.
- the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
- the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
- the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database.
- the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
- the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
- the method disclosed herein further comprises obtaining the biological sample of said subject.
- the biological sample is a solid tumor or liquid biopsy.
- the biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
- the biological sample comprises cancer tissue.
- the cancer tissue comprises tumor-infiltrating immune cells.
- the biological sample is a mixed sample comprising said cancer tissue and noncancer cells.
- the method disclosed herein further comprises processing said biological sample to prevent or inhibit tissue degradation.
- the biological sample is processed into a formalin-fixed paraffin-embedded sample.
- the method disclosed herein further comprises extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data.
- the RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
- the disease or condition is cancer.
- the cancer is a solid cancer or a hematopoietic cancer.
- the cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
- the method disclosed herein further comprises selecting said subject for prediction of said treatment outcome based on said status.
- the treatment outcome corresponds to one or more cancer treatments.
- the one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
- the subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
- the subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
- the method disclosed herein further comprises selecting said subject for generating said determination indicative of said treatment outcome based on a current status of said disease or condition.
- the subject is treated based at least on said determination indicative of said treatment outcome.
- the subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
- a computer-implemented system for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition comprising a processor and non-transitory computer readable storage medium comprising instructions that, when executed by the processor, causes the processor to: (i) obtain gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; (ii) conduct a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; (iii) process, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and (iv) generate a determination indicative of the treatment outcome based on the output.
- the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
- the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
- the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1.
- the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
- the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
- the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database.
- the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
- the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
- the processor is configured to obtain the gene expression data for the biological sample of said subject from a database.
- the biological sample is a solid tumor or liquid biopsy.
- the biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
- the biological sample comprises cancer tissue.
- the cancer tissue comprises tumor-infiltrating immune cells.
- the biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
- the biological sample is processed to prevent or inhibit tissue degradation.
- the biological sample is processed into a formalin-fixed paraffin-embedded sample.
- the RNA is extracted from said biological sample, an RNA library is generated from said extracted RNA, and RNA-Seq is performed on the RNA library to generate said gene expression data.
- the RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
- the disease or condition is cancer.
- the cancer is a solid cancer or a hematopoietic cancer.
- the cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
- the subject is selected for prediction of said treatment outcome based on said status.
- the treatment outcome corresponds to one or more cancer treatments.
- the one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
- the subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
- the subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
- the subject is selected for evaluation to generate said determination indicative of said treatment outcome based on a current status of said disease or condition.
- the subject is treated based at least on said determination indicative of said treatment outcome.
- the subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
- a method for generating a trained machine learning model configured to generate a prediction of treatment outcome comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
- the plurality of biological samples is obtained from said subjects prior to receiving said treatment and said subjects
- the method further comprises configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
- FIG. 1 shows a receiver operating characteristic (ROC) curve of false positive rate (FPR) vs. true positive rate (TPR) of a machine learning model trained on a gene set enrichment analysis (GSEA) training set for clinical outcome according to one or more embodiments herein .
- ROC receiver operating characteristic
- FPR false positive rate
- TPR true positive rate
- GSEA gene set enrichment analysis
- FIG. 2 shows a graph of training samples across out of bag (OOB) samplings of a GSEA training set for clinical outcome according to one or more embodiments herein;
- FIG. 3 shows a graph of a percentage of patients that had a response to treatment (disease control rate, DCR) per score division (quartile) of a GSEA training set for clinical outcome according to one or more embodiments herein;
- FIG. 4 shows a non-limiting example of a computing device; in this case, a device with one or more processors, memory, storage, and a network interface; and
- FIG. 5 shows a non-limiting example of a workflow for processing a biological sample and using gene set enrichment analysis and machine learning model to predict a response to therapy or a treatment outcome.
- FIG. 6 shows ROC curves of false positive rate (specificity) vs. true positive rate (sensitivity) for models, a single-sample GSEA (ssGSEA) biomarker model and a PD-L1 biomarker model.
- the ssGSEA model shows better performance on a Head and Neck Squamous Cell Carcinoma (HNSCC) dataset than does the clinically used PD-L1 biomarker model biomarker model. Shown are the mean OOB prediction scores of each of the samples used to train the model to build a single ROC curve, with a single value for AUC.
- FIG. 1 and FIG. 6 use slightly different forms of the GSEA biomarker model based on the same dataset.
- Machine learning models can be trained and used to evaluate enrichment scores derived from gene set enrichment analysis to provide accurate predictions.
- biomarker genes may be quantified directly and combined with immune cell information to make up a feature set for statistical analysis.
- the instant disclosure includes the discovery that a computationally simpler and more coherent approach using gene set enrichment analysis can provide accurate predictions without relying on such algorithms for quantifying immune cells within a sample.
- gene set enrichment scores may be directly used as features in a machine learning model to predict treatment outcome (e.g., response to immunotherapy) without going through an unnecessary intermediate step of deconvolving gene expression data to quantify immune cells and then using the quantified numbers as input features.
- the systems and methods disclosed herein can provide highly accurate evaluations or determinations indicative of an outcome.
- performance metrics include accuracy, specificity, sensitivity, positive predictive value, negative predictive value, and receiver operating characteristic/ area under receiver operating characteristic (ROC/AUROC). Any combination of these metrics may be determined for a machine learning model or classifier by testing it against a set of independent samples.
- True positive (TP) is a positive test result that detects the condition when the condition is present (e.g., positive response to cancer treatment).
- True negative (TN) is a negative test result that does not detect the condition when the condition is absent.
- False positive (FP) is a test result that detects the condition when the condition is absent.
- False negative (FN) is a test result that does not detect the condition when the condition is present.
- the performance metrics of accuracy, specificity, sensitivity, positive predictive value, and negative predictive value can then be defined according to the following formulas:
- the AUROC can be determined by creating the ROC curve which entails plotting the true positive rate (TP) against the false positive rate (FP) and varies between 0 and 1.
- a sample may be evaluated according to the systems and methods disclosed herein to generate an evaluation or determination such as a prediction of treatment outcome that provide a minimum threshold of performance.
- the analytical algorithm or module e.g., comprising a machine learning model
- the analytical algorithm or module has an accuracy of at least about 50%, 60%, 70%, 80%, 90%, or 95%.
- the analytical algorithm or module has a specificity of at least about 50%, 60%, 70%, 80%, 90%, or 95%.
- the analytical algorithm or module has a sensitivity of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has a PPV of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has an NPV of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has an ROC of at least about 0.6, 0.7, 0.8, 0.85, 0.9, or 0.95 or higher. [0046] In some embodiments, the methods disclosed herein comprise processing a biological sample to obtain gene expression data and performing gene set expression analysis on the gene expression data to generate an evaluation or prediction of outcome.
- RNA 501 a biological sample is processed to extract RNA 501.
- the biological sample may be a formalin-fixed paraffin-embedded (FFPE) sample.
- FFPE formalin-fixed paraffin-embedded
- the extracted RNA is used to generate an mRNA-Seq library 502.
- suitable methods may be used for library generation including commercial kits such as, for example, the QuantSeq 3’ mRNA-Seq library prep kit.
- Next Generation Sequencing is then performed on the library 503.
- Various suitable platforms may be used for the sequencing, for example, the NextSeq platform by Illumina.
- Gene Set Enrichment Analysis is performed on the gene expression data generated from the sequencing 504.
- Various gene sets may be used including independently curated gene sets as well as from public databases such as, for example, gene sets obtained from MSigDB.
- the gene sets can be derived from various collections such as hallmark gene sets, positional gene sets, curated gene sets, chemical and genetic perturbations, canonical pathways, regulatory target, microRNA targets, transcription factor targets, computational gene sets, cancer gene neighborhoods, cancer modules, ontology gene sets, Gene Ontology derived gene sets, oncogenic signature gene sets, immunologic signature gene sets, cell type signature gene sets, or any combination thereof.
- Subsets of the canonical pathways gene sets include gene sets derived from BioCarta pathway database, KEGG pathway database, PID pathway database, Reactome pathway database, and WikiPathways pathway database.
- the ssGSEA can be used to generate an output corresponding to the gene sets that have been evaluated using the gene expression data.
- the output can be a metric or a score, for example, an enrichment score for each gene set.
- the machine learning model analyzes the enrichment scores corresponding to the gene sets to predict a response to therapy 505.
- Non-limiting examples of gene sets suitable for use according to the systems and methods disclosed herein are provided in Table 2.
- each enrichment score for a gene set forms a feature that makes up part of the input to the trained machine learning model.
- the response to therapy can be any suitable metric, indicator, or classification.
- a regression model may output a number between 0 and 1 indicative of responsiveness to therapy.
- a classifier may generate a classification between two or more categories such as, for example, response to treatment, partial response to treatment, no response to treatment, survival, etc.
- Various suitable machine learning models can be used.
- Support vector machine (SVM) is suitable for both regression and classification analysis and can provide a high level of accuracy without requiring significant computing power.
- PCA principal component analysis
- the machine learning model is configured to process at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, or 100 gene set metrics (e.g., enrichment scores). In some cases, the machine learning model is configured to process no more than 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, of 500 gene sets. In some cases, each gene set independently comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
- each gene set independently comprises no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
- gene set enrichment analysis may be performed on 10 different gene sets, one of which has 10 genes and one of which has 200 genes.
- the systems and methods disclosed herein utilize 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 gene sets, each of which independently comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes and/or no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes.
- one or more of the gene sets used as features in the predictive model do not utilize the full list of genes within a known gene set. For example, a less than 100% fraction of the genes in a given gene set may be used for calculating an output or metric for that gene set. In some cases, for a given gene set (such as any one or more of those listed in Table 2), at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the genes listed for the gene set are used to calculate the output or metric for the gene set.
- gene set 1 (“HALLMARK EPITHELIAL MESENCHYMAL TRANSITION”) from Table 2 includes 200 genes associated with epithelial to mesenchymal transition.
- the gene set enrichment analysis performed according to the systems and methods disclosed herein may utilize 50% of the genes in this gene set with respect to epithelial to mesenchymal transition in combination with certain independently determined percentages of other gene sets in Table 1.
- any combination of genes within each gene set may be used for gene set enrichment analysis to generate a corresponding output metric such as an enrichment score. Then the output for a plurality of gene sets can be used as input features provided to a machine learning algorithm or model to generate a composite score indicative of a prediction such as an outcome or treatment outcome.
- the identities of the genes making up each gene set listed in Table 2 can be found on the publicly accessible database MSigDB and are also listed in Table 3, which shows the gene member identification used by MSigDB alongside the corresponding NCBI Gene ID and Gene Symbol.
- the output can make up the features that are processed using an algorithm such as a trained model generated using machine learning to generate an evaluation such as a predicted treatment outcome.
- an algorithm such as a trained model generated using machine learning to generate an evaluation such as a predicted treatment outcome.
- a predicted treatment outcome from a sample of a subject.
- the subject has or is suspected of having a disease or disorder.
- the disease or disorder can be a cancer.
- the predicted treatment outcome is for an immunotherapy targeting a cancer.
- the methods disclosed herein comprise obtaining a sample from a subject.
- the sample is any fluid or other material derived from the body of a normal or disease subject including, but not limited to, blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, milk, amniotic fluid, bile, ascites fluid, organ or tissue extract, and culture fluid in which any cells or tissue preparation from a subject has been incubated.
- the sample is obtained from skin, blood, brain, bladder, bone, bone marrow, breast, colon, stomach, esophagus, ovary, uterus, gallbladder, fallopian tube, testicle, kidney, liver, pancreas, adrenal gland, cervix, endometrium, head or neck, lung, prostate, thymus, thyroid, lymph node, or urinary bladder.
- the sample is a cancer sample or biopsy.
- the cancer sample is typically a solid tumor sample or a liquid tumor sample.
- the cancer sample can be obtained from excised tissue.
- the samples is fresh, frozen, or fixed.
- a fixed sample comprises paraffin-embedded or fixation by formalin, formaldehyde, or gluteraldehyde.
- the sample is formalin-fixed paraffin-embedded.
- the sample is stored after it has been collected, but before additional steps are to be performed. In some instances, the sample is stored at less than 8° C. In some instances, the sample is stored at less than 4° C. In some instances, the sample is stored at less than 0° C. In some instances, the sample is stored at less than -20° C. In some instances, the sample is stored at less than -70° C. In some instances, the sample is stored a solution comprising glycerol, glycol, dimethyl sulfoxide, growth media, nutrient broth or any combination thereof. The sample may be stored for any suitable period of time. In some instances, the sample is stored for any period of time and remains suitable for downstream applications.
- the sample is stored for any period of time before nucleic acid (e.g., ribonucleic acid (RNA) or deoxyribonucleic acid (DNA)) extraction.
- nucleic acid e.g., ribonucleic acid (RNA) or deoxyribonucleic acid (DNA)
- the sample is stored for at least or about 1 day, 2 day, 3 days, 4 days, 5 days, 6 days, 7 days, 1 week, 2 weeks, 3 weeks, 4 weeks, 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, 7 months, 8 months, 9 months, 10 months, 11 months, 12 months, or more than 12 months.
- the sample is stored for at least 1 year, 2 years, 3, years, 4 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years, 11 years, 12 years, or more than 12 years.
- Methods and systems as described herein comprise generating an immune-oncology profile from a sample of a subject, wherein the sample comprises a nucleic acid molecule.
- the nucleic acid molecule is RNA, DNA, fragments, or combinations thereof.
- the sample is processed further before analysis.
- the sample is processed to extract the nucleic acid molecule from the sample.
- no extraction or processing procedures are performed on the sample.
- the nucleic acid is extracted using any technique that does not interfere with subsequent analysis. Extraction techniques include, for example, alcohol precipitation using ethanol, methanol or isopropyl alcohol. In some instances, extraction techniques use phenol, chloroform, or any combination thereof.
- extraction techniques use a column or resin based nucleic acid purification scheme such as those commonly sold commercially.
- the nucleic acid molecule is purified.
- the nucleic acid molecule is further processed.
- RNA is further reverse transcribed to cDNA.
- processing of the nucleic acid comprises amplification.
- the nucleic acid is stored in water, Tris buffer, or Tris-EDTA buffer before subsequent analysis.
- the sample is stored at less than 8° C. In some instances, the sample is stored at less than 4° C. In some instances, the sample is stored at less than 0° C.
- a nucleic acid molecule obtained from a sample comprises may be characterized by factors such as integrity of the nucleic acid molecule or size of the nucleic acid molecule. In some instances, the nucleic acid molecule is DNA.
- the nucleic acid molecule is RNA.
- the RNA or DNA comprises a specific integrity.
- the RNA integrity number (RIN) of the RNA is no more than about 2.
- the RNA molecules in a sample have a RIN of about 2 to about 10.
- the RNA molecules in a sample have a RIN of at least about 2.
- the RNA molecules in a sample have a RIN of at most about 10.
- the RNA molecules in a sample have a RIN of about 2 to about 3, about 2 to about 4, about 2 to about 5, about 2 to about 6, about 2 to about 7, about 2 to about 8, about 2 to about 9, about 2 to about 10, about 3 to about 4, about 3 to about 5, about 3 to about 6, about 3 to about 7, about 3 to about 8, about 3 to about 9, about 3 to about 10, about 4 to about 5, about 4 to about 6, about 4 to about 7, about 4 to about 8, about 4 to about 9, about 4 to about 10, about 5 to about 6, about 5 to about 7, about 5 to about 8, about 5 to about 9, about 5 to about 10, about 6 to about 7, about 6 to about 8, about 6 to about 9, about 6 to about 10, about 7 to about 8, about 7 to about 9, about 7 to about 10, about 8 to about 9, about 8 to about 10, or about 9 to about 10.
- the RNA molecule in a sample may be characterized by size. In some instances, at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, or 90%, or more of the RNA molecules in a sample are at least 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, or more than 400 nucleotides in size. In some instances, the RNA molecules in the sample are at least 200 nucleotides in size. In some instances, the RNA molecules of at least 200 nucleotides in size comprise a percentage of the sample (DV200).
- the percentage is at least or about 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or more than 95%.
- the RNA molecules in a sample have a DV200 value of about 10% to about 90%. In some instances, the RNA molecules in a sample have a DV200 value of at least about 10%. In some instances, the RNA molecules in a sample have a DV200 value of at most about 90%.
- the RNA molecules in a sample have a DV200 value of about 10% to about 20%, about 10% to about 30%, about 10% to about 40%, about 10% to about 50%, about 10% to about 60%, about 10% to about 70%, about 10% to about 80%, about 10% to about 90%, about 20% to about 30%, about 20% to about 40%, about 20% to about 50%, about 20% to about 60%, about 20% to about 70%, about 20% to about 80%, about 20% to about 90%, about 30% to about 40%, about 30% to about 50%, about 30% to about 60%, about 30% to about 70%, about 30% to about 80%, about 30% to about 90%, about 40% to about 50%, about 40% to about 60%, about 40% to about 70%, about 40% to about 80%, about 40% to about 90%, about 50% to about 60%, about 50% to about 70%, about 50% to about 80%, about 50% to about 90%, about 60% to about 70%, about 60% to about 80%, about 60% to about 90%, about 70% to about 80%, about 70% to about 90%, or about 80% to about 90%.
- the nucleic acid molecule is prepared for sequencing.
- a sequencing library is prepared. Numerous library generation methods have been described.
- methods for library generation comprise addition of a sequencing adapter. Sequencing adapters may be added to the nucleic acid molecule by ligation.
- library generation comprises an end-repair reaction.
- library generation for sequencing comprises an enrichment step. For example, coding regions of the mRNA are enriched. In some instances, the enrichment step is for a subset of genes. In some instances, the enrichment step comprises using a bait set.
- the bait set may be used to enrich for genes used for specific downstream applications.
- a bait set generally refers to a set of baits targeted toward a selected set of genomic regions of interest. For example, a bait set may be selected for genomic regions relating to at least one of immune modulatory molecule expression, cell type and ratio, or mutational burden. In some instances, one bait set is used for determining immune modulatory molecule expression, a second bait set is used for determining cell type and ratio, and a third bait set is used for determining mutational burden.
- a bait set comprises at least one unique molecular identifier (UMI).
- UMI unique molecular identifier
- UMI unique molecular identifier
- UMI refers to nucleic acid having a sequence which can be used to identify and/or distinguish one or more first molecules to which the UMI is conjugated from one or more second molecules.
- the UMI is conjugated to one or more target molecules of interest or amplification products thereof.
- UMIs may be single or double stranded.
- the systems and methods disclosed herein provide for the sequencing for a number of genes.
- the number of genes is at least about 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, or more than 10000 genes.
- the number of genes to be sequenced is in a range of about 500 to about 1000 genes.
- the number of genes to be sequenced is in a range of about at least 200.
- the number of genes to be sequenced is in a range of about at most 10,000.
- the number of genes to be sequenced is in a range of about 200 to 500, 200 to 1,000, 200 to 2,000, 200 to 4,000, 200 to 6,000, 200 to 8,000, 200 to 10,000, 500 to 1,000, 500 to 2,000, 500 to 4,000, 500 to 6,000, 500 to 8,000, 500 to 10,000, 1,000 to 2,000, 1,000 to 4,000, 1,000 to 6,000, 1,000 to 8,000, 1,000 to 10,000, 2,000 to 4,000, 2,000 to 6,000, 2,000 to 8,000, 2,000 to 10,000, 4,000 to 6,000, 4,000 to 8,000, 4,000 to 10,000, 6,000 to 8,000, 6,000 to 10,000, or 8,000 to 10,000.
- Sequencing may be performed with any appropriate sequencing technology.
- sequencing methods include, but are not limited to single molecule real-time sequencing, Polony sequencing, sequencing by ligation, reversible terminator sequencing, proton detection sequencing, ion semiconductor sequencing, nanopore sequencing, electronic sequencing, pyrosequencing, Maxam-Gilbert sequencing, chain termination (e.g., Sanger) sequencing, +S sequencing, or sequencing by synthesis.
- Sequencing methods may include, but are not limited to, one or more of: high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq (Illumina), Digital Gene Expression (Helicos), Next generation sequencing, Single Molecule Sequencing by Synthesis (SMSS) (Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, Maxim-Gilbert sequencing, and primer walking. Sequencing may generate sequencing reads (“reads”), which may be processed (e.g., alignment) to yield longer sequences, such as consensus sequences.
- reads sequencing reads
- An average read length from sequencing may vary.
- the average read length is at least about 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, or more than 80000 base pairs.
- the average read length is in a range of about 100 to 80,000.
- the average read length is in a range of about at least 100.
- the average read length is in a range of about at most 80,000.
- the average read length is in a range of about 100 to 200, 100 to 300, 100 to 500, 100 to 1,000, 100 to 2,000, 100 to 4,000, 100 to 8,000, 100 to 10,000, 100 to 20,000, 100 to 40,000, 100 to 80,000, 200 to 300, 200 to 500, 200 to 1,000, 200 to 2,000, 200 to 4,000, 200 to 8,000, 200 to 10,000, 200 to 20,000, 200 to 40,000, 200 to 80,000, 300 to 500, 300 to 1,000, 300 to 2,000, 300 to 4,000, 300 to 8,000, 300 to 10,000, 300 to 20,000, 300 to 40,000, 300 to 80,000, 500 to 1,000, 500 to 2,000, 500 to 4,000, 500 to 8,000, 500 to 10,000, 500 to 20,000, 500 to 40,000, 500 to 80,000, 1,000 to 2,000, 1,000 to 4,000, 1,000 to 8,000, 1,000 to 10,000, 1,000 to 20,000, 1,000 to 40,000, 1,000 to 80,000, 2,000 to 4,000, 2,000 to 8,000, 2,000 to 10,000, 2,000 to 20,000, 40,000, 1,000 to 80,000, 2,000
- a number of nucleotides that are sequenced are at least or about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500, 2000, 2500, 3000, or more than 3000 nucleotides. In some instances, the number of nucleotides that are sequenced are about 5 to about 3,000 nucleotides. In some instances, the number of that are sequenced are at least 5 nucleotides. In some instances, the number of nucleotides that are sequenced are at most 3,000 nucleotides.
- the number of nucleotides that are sequenced are 5 to 50, 5 to 100, 5 to 200, 5 to 400, 5 to 600, 5 to 800, 5 to 1,000, 5 to 1,500, 5 to 2,000, 5 to 2,500, 5 to 3,000, 50 to 100, 50 to 200, 50 to 400, 50 to 600, 50 to 800, 50 to 1,000, 50 to 1,500, 50 to 2,000, 50 to 2,500, 50 to 3,000, 100 to 200, 100 to 400, 100 to 600, 100 to 800, 100 to 1,000, 100 to 1,500, 100 to 2,000, 100 to 2,500, 100 to 3,000, 200 to 400, 200 to 600, 200 to 800, 200 to 1,000, 200 to 1,500, 200 to 2,000, 200 to 2,500, 200 to 3,000, 400 to 600, 400 to 800, 400 to 1,000, 400 to 1,500, 400 to 2,000, 400 to 2,500, 400 to 3,000, 600 to 800, 600 to 1,000, 400 to 1,500, 400 to 2,000, 400 to 2,500, 400 to 3,000, 600 to 800, 600 to 1,000, 400 to 1,500,
- Sequencing methods may include a barcoding or “tagging” step.
- barcoding (or “tagging”) can allow for generation of a population of samples of nucleic acids, wherein each nucleic acid can be identified from which sample the nucleic acid originated.
- the barcode comprises oligonucleotides that are ligated to the nucleic acids.
- the barcode is ligated using an enzyme, including but not limited to, E. coli ligase, T4 ligase, mammalian ligases (e.g., DNA ligase I, DNA ligase II, DNA ligase III, DNA ligase IV), thermostable ligases, and fast ligases.
- Barcoding or tagging may occur using various types of barcodes or tags.
- barcodes or tags include, but are not limited to, a radioactive barcode or tag, a fluorescent barcode or tag, an enzyme, a chemiluminescent barcode or tag, and a colorimetric barcode or tag.
- the barcode or tag is a fluorescent barcode or tag.
- the fluorescent barcode or tag comprises a fluorophore.
- the fluorophore is an aromatic or heteroaromatic compound.
- the fluorophore is a pyrene, anthracene, naphthalene, acridine, stilbene, benzoxaazole, indole, benzindole, oxazole, thiazole, benzothiazole, canine, carbocyanine, salicylate, anthranilate, xanthenes dye, coumarin.
- xanthene dyes include, e.g., fluorescein and rhodamine dyes.
- Fluorescein and rhodamine dyes include, but are not limited to, 6-carboxyfluorescein (FAM), 2'7'-dimethoxy- 4'5'-dichloro-6-carboxyfluorescein (JOE), tetrachlorofluorescein (TET), 6-carboxyrhodamine (R6G), N,N,N,N'-tetramethyl-6-carboxyrhodamine (TAMRA), 6-carboxy-X-rhodamine (ROX).
- FAM 6-carboxyfluorescein
- JE 2'7'-dimethoxy- 4'5'-dichloro-6-carboxyfluorescein
- TET tetrachlorofluorescein
- R6G 6-carboxyrhodamine
- TAMRA 6-carboxy-X-
- the fluorescent barcode or tag also includes the naphthylamine dyes that have an amino group in the alpha or beta position.
- naphthylamino compounds include l-dimethylaminonaphthyl-5-sulfonate, l-anilino-8-naphthalene sulfonate and 2-p-toluidinyl-6- naphthalene sulfonate, 5-(2'-aminoethyl)aminonaphthalene-l -sulfonic acid (EDANS).
- Examples of coumarins include, e.g., 3-phenyl-7-isocyanatocoumarin; acridines, such as 9- isothiocyanatoacridine and acridine orange; N-(p-(2-benzoxazolyl)phenyl) maleimide; cyanines, such as, e.g., indodi carbocyanine 3 (Cy3), indodicarbocyanine 5 (Cy5), indodicarbocyanine 5.5 (Cy5.5), 3-(-carboxy-pentyl)-3'-ethyl-5,5'-dimethyloxacarbocyanine (CyA); 1H, 5H, 11H, 15H- Xantheno[2,3, 4-ij : 5,6,7-i'j ']diquinolizin-l 8-ium, 9-[2 (or 4)-[[[6-[2,5-dioxo-l- pyrroli
- barcode lengths include barcode sequences comprising, without limitation, about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25 or more bases in length.
- barcode lengths include barcode sequences comprising, without limitation, from 1-5, 1-10, 5-20, or 1-25 bases in length. Barcode systems may be in base 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 or a similar coding scheme.
- a number of barcodes is at least or about 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000, 18000, 20000, 25000, 30000, 40000, 50000, 100000, 500000, 1000000, or more than 1000000 barcodes. In some instances, a number of barcodes is in a range of 1-1000000 barcodes.
- the number of barcodes is in a range of about 1-10 1-50 1-100 1-500 1-1000 1-5,000 1-10000 1-50000 1-100000 1-500000 1-1000000 10-50 10-100 10-500 10-1000 10-5,000 10-10000 10-50000 10-100000 10-500000 10-1000000 50-100 50-500 50-1000 50-5,000 50-10000 50-50000 50-100000 50-500000 50-1000000 100- 500 100-1000 100-5,000 100-10000 100-50000 100-100000 100-500000 100-1000000 500- 1000 500-5,000 500-10000 500-50000 500-100000 500-500000 500-1000000 1000-5,000 1000- 10000 1000-50000 1000-100000 1000-500000 1000-1000-1000000 5,000-10000 5,000-50000 5,000- 100000 5,000-500000 5,000-1000000 10000 10000-100000 10000-500000 10000- 1000000 50000-100000 50000-500000 50000-1000000 100000-500000 100000-1000000 or 500000-1000000 barcodes.
- GSEA Gene Set Enrichment Analysis
- a predefined set of genes may be evaluated to produce an output or metric such as, for example, a score corresponding to the difference between two or more categories or biological states.
- Multiple sets of genes can be evaluated to generate multiple such outputs or metrics.
- These outputs or metrics may comprise the features of a model such as a trained machine learning model configured to generate predictions with respect to the categories or biological state.
- the model may be a regression that generates an output along a continuum (e.g., any value between 0 and 1) or a classifier which generates a classification for a data set.
- the sample often comprises a heterogeneous composition of different cell types and/or subtypes.
- the sample is a tumor sample.
- the cell types and/or subtypes that make up the sample includes one or more of cancer cells, non-cancer cells, and/or immune cells.
- non-immune cells examples include salivary gland cells, mammary gland cells, lacrimal gland cells, ceruminous gland cells, eccrine sweat gland cells, apocrine sweat gland cells, sebaceous gland cells, Bowman's gland cells, Brunner's gland cells, prostate gland cells, seminal vesicle cells, bulbourethral gland cells, keratinizing epithelial cells, hair shaft cells, epithelial cells, exocrine secretory epithelial cells, uterus endometrium cells, isolated goblet cells of respiratory and digestive tracts, stomach lining mucous cells, hormone secreting cells, pituitary cells, gut and respiratory tract cells, thyroid gland cells, adrenal gland cells, chromaffin cells, Leydig cells, theca interna cells, macula densa cells of kidney, peripolar cells of kidney, mesangial cells of kidney, hepatocytes, white fat cells, brown fat cells, liver lipocytes, kidney cells, kidney glomerulus parietal
- lymphoid cells include, but are not limited to, CD4+ memory T- cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells, effector memory T (Tern) cells, CD4+ Tcm, CD4+ Tern, CD8+ T-cells, CD8+ naive T-cells, CD8+ Tcm, CD8+ Tem, regulatory T cells (Tregs), T helper (Th) 1 cells, Th2 cells, gamma delta T (Tgd) cells, natural killer (NK) cells, natural killer T (NKT) cells, B-cells, naive B-cells, memory B-cells, cl ass- switched memory B-cells, pro B-cells, and plasma cells.
- lymphoid cells include, but are not limited to, CD4+ memory T- cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells
- the cells are stromal cells, for example, mesenchymal stem cells, adipocytes, preadipocytes, stromal cells, fibroblasts, pericytes, endothelial cells, microvascular endothelial cells, lymphatic endothelial cells, smooth muscle cells, chondrocytes, osteoblasts, skeletal muscle cells, myocytes.
- stromal cells for example, mesenchymal stem cells, adipocytes, preadipocytes, stromal cells, fibroblasts, pericytes, endothelial cells, microvascular endothelial cells, lymphatic endothelial cells, smooth muscle cells, chondrocytes, osteoblasts, skeletal muscle cells, myocytes.
- stem cells include, but are not limited to, hematopoietic stem cells, common lymphoid progenitor cells, common myeloid progenitor cells, granulocyte-macrophage progenitor cells, megakaryocyte-erythroid progenitor cells, multipotent progenitor cells, megakaryocytes, erythrocytes, and platelets.
- myeloid cells include, but are not limited to, monocytes, macrophages, macrophages Ml, macrophages M2, dendritic cells, conventional dendritic cells, plasmacytoid dendritic cells, immature dendritic cells, neutrophils, eosinophils, mast cells, and basophils.
- the sequencing data comprises genes that are differentially expressed by various immune cell types.
- immune cells to be detected by methods described herein include, but are not limited to, CD4+ memory T-cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells, effector memory T (Tern) cells, CD4+ Tcm, CD4+ Tern, CD8+ T-cells, CD8+ naive T-cells, CD8+ Tcm, CD8+ Tern, regulatory T cells (Tregs), T helper (Th) 1 cells, Th2 cells, gamma delta T (Tgd) cells, natural killer (NK) cells, natural killer T (NKT) cells, B-cells, naive B-cells, memory B-cells, cl ass- switched memory B-cells, pro B-cells, and plasma cells.
- Tregs regulatory T cells
- Th2 cells Th2 cells
- Tgd gamma delta T (Tgd) cells
- NK natural killer
- NKT natural killer
- the term “about” refers to an amount that is near the stated amount by 10%, 5%, or 1%, including increments therein.
- the term “about” in reference to a percentage refers to an amount that is greater or less the stated percentage by 10%, 5%, or 1%, including increments therein.
- each of the expressions “at least one of A, B and C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.
- RNA refers to a molecule comprising at least one ribonucleotide residue.
- RNA may include transcripts.
- ribonucleotide is meant a nucleotide with a hydroxyl group at the 2’ position of a beta-D-ribo-furanose moiety.
- RNA includes, but not limited to, mRNA, ribosomal RNA, tRNA, non-protein-coding RNA (npcRNA), non-messenger RNA, functional RNA (fRNA), long non-coding RNA (IncRNA), pre-mRNAs, and primary miRNAs (pri-miRNAs).
- RNA includes, for example, double-stranded (ds) RNAs; single-stranded RNAs; and isolated RNAs such as partially purified RNA, essentially pure RNA, synthetic RNA, recombinant RNA, as well as altered RNA that differ from naturally-occurring RNA by the addition, deletion, substitution and/or alteration of one or more nucleotides.
- alterations can include addition of non-nucleotide material, such as to the end(s) of the siRNA or internally, for example at one or more nucleotides of the RNA.
- Nucleotides in the RNA molecules described herein can also comprise non-standard nucleotides, such as non-naturally occurring nucleotides or chemically synthesized nucleotides or deoxynucleotides. These altered RNAs can be referred to as analogs or analogs of naturally- occurring RNA. [0080] Unless specifically stated or obvious from context, as used herein, the term “about” in reference to a number or range of numbers is understood to mean the stated number and numbers +/- 10% thereof, or 10% below the lower listed limit and 10% above the higher listed limit for the values listed for a range.
- sample generally refers to a biological sample of a subject.
- the biological sample may be a tissue or fluid of the subject, such as blood (e.g., whole blood), plasma, serum, urine, saliva, mucosal excretions, sputum, stool and tears.
- the biological sample may be derived from a tissue or fluid of the subject.
- the biological sample may be a tumor sample or heterogeneous tissue sample.
- the biological sample may have or be suspected of having disease tissue.
- the tissue may be processed to obtain the biological sample.
- the biological sample may be a cellular sample.
- the biological sample may be a cell-free (or cell free) sample, such as cell-free DNA or RNA.
- the biological sample may comprise cancer cells, non-cancer cells, immune cells, non-immune cells, or any combination thereof.
- the biological sample may be a tissue sample.
- the biological sample may be a liquid sample.
- the liquid sample can be a cancer or non-cancer sample.
- Non-limiting examples of liquid biological samples include synovial fluid, whole blood, blood plasma, lymph, bone marrow, cerebrospinal fluid, serum, seminal fluid, urine, and amniotic fluid.
- variant generally refers to a genetic variant, such as an alteration, variant or polymorphism in a nucleic acid sample or genome of a subject. Such alteration, variant or polymorphism can be with respect to a reference genome, which may be a reference genome of the subject or other individual.
- Single nucleotide polymorphisms are a form of polymorphisms.
- one or more polymorphisms comprise one or more single nucleotide variations (SNVs), insertions, deletions, repeats, small insertions, small deletions, small repeats, structural variant junctions, variable length tandem repeats, and/or flanking sequences.
- Copy number variants (CNVs), transversions and other rearrangements are also forms of genetic variation.
- a genomic alternation may be a base change, insertion, deletion, repeat, copy number variation, or transversion.
- the term “subject,” as used herein, generally refers to an animal, such as a mammalian species (e.g., human) or avian (e.g., bird) species, or other organism, such as a plant. More specifically, the subject can be a vertebrate, a mammal, a mouse, a primate, a simian or a human. Animals include, but are not limited to, farm animals, sport animals, and pets.
- the subject can be a healthy individual, an individual that has or is suspected of having a disease or a predisposition to the disease, or an individual that is in need of therapy or suspected of needing therapy.
- the subject can be a patient.
- the subject may have or be suspected of having a disease.
- FIG. 4 a block diagram is shown depicting an exemplary machine that includes a computer system 400 (e.g., a processing or computing system) within which a set of instructions can execute for causing a device to perform or execute any one or more of the aspects and/or methodologies for static code scheduling of the present disclosure.
- a computer system 400 e.g., a processing or computing system
- the components in FIG. 4 are examples only and do not limit the scope of use or functionality of any hardware, software, embedded logic component, or a combination of two or more such components implementing particular embodiments.
- Computer system 400 may include one or more processors 401, a memory 403, and a storage 408 that communicate with each other, and with other components, via a bus 440.
- the bus 440 may also link a display 432, one or more input devices 433 (which may, for example, include a keypad, a keyboard, a mouse, a stylus, etc.), one or more output devices 434, one or more storage devices 435, and various tangible storage media 436. All of these elements may interface directly or via one or more interfaces or adaptors to the bus 440.
- the various tangible storage media 436 can interface with the bus 440 via storage medium interface 426.
- Computer system 400 may have any suitable physical form, including but not limited to one or more integrated circuits (ICs), printed circuit boards (PCBs), mobile handheld devices (such as mobile telephones or PDAs), laptop or notebook computers, distributed computer systems, computing grids, or servers.
- ICs integrated circuits
- PCBs printed circuit boards
- mobile handheld devices such as mobile telephone
- Computer system 400 includes one or more processor(s) 401 (e.g., central processing units (CPUs) or general purpose graphics processing units (GPGPUs)) that carry out functions.
- processor(s) 401 optionally contains a cache memory unit 402 for temporary local storage of instructions, data, or computer addresses.
- Processor(s) 401 are configured to assist in execution of computer readable instructions.
- Computer system 400 may provide functionality for the components depicted in FIG. 4 as a result of the processor(s) 401 executing non-transitory, processor-executable instructions embodied in one or more tangible computer-readable storage media, such as memory 403, storage 408, storage devices 435, and/or storage medium 436.
- the computer-readable media may store software that implements particular embodiments, and processor(s) 401 may execute the software.
- Memory 403 may read the software from one or more other computer-readable media (such as mass storage device(s) 435, 436) or from one or more other sources through a suitable interface, such as network interface 420.
- the software may cause processor(s) 401 to carry out one or more processes or one or more steps of one or more processes described or illustrated herein. Carrying out such processes or steps may include defining data structures stored in memory 403 and modifying the data structures as directed by the software.
- the memory 403 may include various components (e.g., machine readable media) including, but not limited to, a random access memory component (e.g., RAM 404) (e.g., static RAM (SRAM), dynamic RAM (DRAM), ferroelectric random access memory (FRAM), phasechange random access memory (PRAM), etc.), a read-only memory component (e.g., ROM 405), and any combinations thereof.
- ROM 405 may act to communicate data and instructions unidirectionally to processor(s) 401
- RAM 404 may act to communicate data and instructions bidirectionally with processor(s) 401.
- ROM 405 and RAM 404 may include any suitable tangible computer-readable media described below.
- a basic input/output system 406 (BIOS) including basic routines that help to transfer information between elements within computer system 400, such as during start-up, may be stored in the memory 403.
- BIOS basic input/output system 406
- Fixed storage 408 is connected bidirectionally to processor(s) 401, optionally through storage control unit 407.
- Fixed storage 408 provides additional data storage capacity and may also include any suitable tangible computer-readable media described herein.
- Storage 408 may be used to store operating system 409, executable(s) 410, data 411, applications 412 (application programs), and the like.
- Storage 408 can also include an optical disk drive, a solid-state memory device (e.g., flash-based systems), or a combination of any of the above.
- Information in storage 408 may, in appropriate cases, be incorporated as virtual memory in memory 403.
- storage device(s) 435 may be removably interfaced with computer system 400 (e.g., via an external port connector (not shown)) via a storage device interface 425.
- storage device(s) 435 and an associated machine-readable medium may provide non-volatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the computer system 400.
- software may reside, completely or partially, within a machine-readable medium on storage device(s) 435.
- software may reside, completely or partially, within processor(s) 401.
- Bus 440 connects a wide variety of subsystems.
- reference to a bus may encompass one or more digital signal lines serving a common function, where appropriate.
- Bus 440 may be any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combinations thereof, using any of a variety of bus architectures.
- such architectures include an Industry Standard Architecture (ISA) bus, an Enhanced ISA (EISA) bus, a Micro Channel Architecture (MCA) bus, a Video Electronics Standards Association local bus (VLB), a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, an Accelerated Graphics Port (AGP) bus, HyperTransport (HTX) bus, serial advanced technology attachment (SATA) bus, and any combinations thereof.
- ISA Industry Standard Architecture
- EISA Enhanced ISA
- MCA Micro Channel Architecture
- VLB Video Electronics Standards Association local bus
- PCI Peripheral Component Interconnect
- PCI-X PCI-Express
- AGP Accelerated Graphics Port
- HTTP HyperTransport
- SATA serial advanced technology attachment
- Computer system 400 may also include an input device 433.
- a user of computer system 400 may enter commands and/or other information into computer system 400 via input device(s) 433.
- Examples of an input device(s) 433 include, but are not limited to, an alpha-numeric input device (e.g., a keyboard), a pointing device (e.g., a mouse or touchpad), a touchpad, a touch screen, a multi-touch screen, a joystick, a stylus, a gamepad, an audio input device (e.g., a microphone, a voice response system, etc.), an optical scanner, a video or still image capture device (e.g., a camera), and any combinations thereof.
- an alpha-numeric input device e.g., a keyboard
- a pointing device e.g., a mouse or touchpad
- a touchpad e.g., a touch screen
- a multi-touch screen e.g., a joystick,
- the input device is a Kinect, Leap Motion, or the like.
- Input device(s) 433 may be interfaced to bus 440 via any of a variety of input interfaces 423 (e.g., input interface 423) including, but not limited to, serial, parallel, game port, USB, FIREWIRE, THUNDERBOLT, or any combination of the above.
- computer system 400 when computer system 400 is connected to network 430, computer system 400 may communicate with other devices, specifically mobile devices and enterprise systems, distributed computing systems, cloud storage systems, cloud computing systems, and the like, connected to network 430. Communications to and from computer system 400 may be sent through network interface 420.
- network interface 420 may receive incoming communications (such as requests or responses from other devices) in the form of one or more packets (such as Internet Protocol (IP) packets) from network 430, and computer system 400 may store the incoming communications in memory 403 for processing.
- IP Internet Protocol
- Computer system 400 may similarly store outgoing communications (such as requests or responses to other devices) in the form of one or more packets in memory 403 and communicated to network 430 from network interface 420.
- Processor(s) 401 may access these communication packets stored in memory 403 for processing.
- Examples of the network interface 420 include, but are not limited to, a network interface card, a modem, and any combination thereof.
- Examples of a network 430 or network segment 430 include, but are not limited to, a distributed computing system, a cloud computing system, a wide area network (WAN) (e.g., the Internet, an enterprise network), a local area network (LAN) (e.g., a network associated with an office, a building, a campus or other relatively small geographic space), a telephone network, a direct connection between two computing devices, a peer-to-peer network, and any combinations thereof.
- a network, such as network 430 may employ a wired and/or a wireless mode of communication. In general, any network topology may be used.
- Information and data can be displayed through a display 432.
- a display 432 include, but are not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a thin film transistor liquid crystal display (TFT-LCD), an organic liquid crystal display (OLED) such as a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display, a plasma display, and any combinations thereof.
- the display 432 can interface to the processor(s) 401, memory 403, and fixed storage 408, as well as other devices, such as input device(s) 433, via the bus 440.
- the display 432 is linked to the bus 440 via a video interface 422, and transport of data between the display 432 and the bus 440 can be controlled via the graphics control 421.
- the display is a video projector.
- the display is a headmounted display (HMD) such as a VR headset.
- suitable VR headsets include, by way of non-limiting examples, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like.
- the display is a combination of devices such as those disclosed herein.
- computer system 400 may include one or more other peripheral output devices 434 including, but not limited to, an audio speaker, a printer, a storage device, and any combinations thereof.
- peripheral output devices may be connected to the bus 440 via an output interface 424.
- Examples of an output interface 424 include, but are not limited to, a serial port, a parallel connection, a USB port, a FIREWIRE port, a THUNDERBOLT port, and any combinations thereof.
- computer system 400 may provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which may operate in place of or together with software to execute one or more processes or one or more steps of one or more processes described or illustrated herein.
- Reference to software in this disclosure may encompass logic, and reference to logic may encompass software.
- reference to a computer-readable medium may encompass a circuit (such as an IC) storing software for execution, a circuit embodying logic for execution, or both, where appropriate.
- the present disclosure encompasses any suitable combination of hardware, software, or both.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- a general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
- a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- a software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
- An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
- the storage medium may be integral to the processor.
- the processor and the storage medium may reside in an ASIC.
- the ASIC may reside in a user terminal.
- the processor and the storage medium may reside as discrete components in a user terminal.
- suitable computing devices include, by way of non-limiting examples, server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, personal digital assistants, video game consoles, and vehicles.
- server computers desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, personal digital assistants, video game consoles, and vehicles.
- Suitable tablet computers include those with booklet, slate, and convertible configurations, known to those of skill in the art.
- the computing device includes an operating system configured to perform executable instructions.
- the operating system is, for example, software, including programs and data, which manages the device’s hardware and provides services for execution of applications.
- server operating systems include, by way of non -limiting examples, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®.
- suitable personal computer operating systems include, by way of non-limiting examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®.
- the operating system is provided by cloud computing.
- suitable mobile smartphone operating systems include, by way of non-limiting examples, Nokia® Symbian® OS, Apple® iOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®.
- suitable media streaming device operating systems include, by way of non-limiting examples, Apple TV®, Roku®, Boxee®, Google TV®, Google Chromecast®, Amazon Fire®, and Samsung® HomeSync®.
- video game console operating systems include, by way of nonlimiting examples, Sony® PS3®, Sony® PS4®, Microsoft® Xbox 360®, Microsoft Xbox One, Nintendo® Wii®, Nintendo® Wii U®, and Ouya®.
- Non-transitory computer readable storage medium
- the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked computing device.
- a computer readable storage medium is a tangible component of a computing device.
- a computer readable storage medium is optionally removable from a computing device.
- a computer readable storage medium includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, distributed computing systems including cloud computing systems and services, and the like.
- the program and instructions are permanently, substantially permanently, semipermanently, or non-transitorily encoded on the media.
- the platforms, systems, media, and methods disclosed herein include at least one computer program, or use of the same.
- a computer program includes a sequence of instructions, executable by one or more processor(s) of the computing device’s CPU, written to perform a specified task.
- Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), computing data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages.
- APIs Application Programming Interfaces
- the functionality of the computer readable instructions may be combined or distributed as desired in various environments.
- a computer program comprises one sequence of instructions.
- a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules. In various embodiments, a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.
- the platforms, systems, media, and methods disclosed herein include software, server, and/or database modules, or use of the same.
- software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art.
- the software modules disclosed herein are implemented in a multitude of ways.
- a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof.
- a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof.
- the one or more software modules comprise, by way of non-limiting examples, a web application, a mobile application, and a standalone application.
- software modules are in one computer program or application. In other embodiments, software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on a distributed computing platform such as a cloud computing platform. In some embodiments, software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location. Databases
- the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same.
- suitable databases include, by way of non-limiting examples, relational databases, non-relational databases, object oriented databases, object databases, entity -relationship model databases, associative databases, and XML databases. Further non-limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, and Sybase.
- a database is internet-based.
- a database is webbased.
- a database is cloud computing-based.
- a database is a distributed database.
- a database is based on one or more local computer storage devices.
- machine learning algorithms are utilized to generate a trained model or classifier configured to process input data comprising a plurality of features and generate an output indicative of a predicted outcome or classification.
- the plurality of features may include scores based on gene sets, for example, GSEA gene set enrichment scores, although metrics calculated based on gene sets are also contemplated.
- the machine learning algorithms herein employ one or more forms of labels including but not limited to human annotated labels and semi-supervised labels.
- the labels can be indicative of treatment outcomes for cancer patients.
- the labels may be indicative of response to immunotherapies.
- Examples of labels includes complete response, partial response, stable disease, and progressive disease as measures of efficacy of a therapeutic intervention for a disease such as cancer.
- the machine learning algorithm utilizes regression modeling, wherein relationships between predictor variables and dependent variables are determined and weighted.
- the predicted outcome e.g., responsiveness to an immunotherapy
- the predicted outcome is a dependent variable and is derived from a plurality of biological features such as GSEA enrichment scores.
- Examples of machine learning algorithms can include a support vector machine (SVM), a naive Bayes classification, a random forest, a neural network, deep learning, principal component analysis (PCA), or other supervised learning algorithm or unsupervised learning algorithm for classification and regression.
- the machine learning algorithms can be trained using one or more training datasets.
- a machine learning algorithm uses a supervised learning approach. In supervised learning, the algorithm generates a function from labeled training data. Each training example is a pair consisting of an input object and a desired output value. In some embodiments, an optimal scenario allows for the algorithm to correctly determine the class labels for unseen instances. In some embodiments, a supervised learning algorithm requires the user to determine one or more control parameters.
- supervised learning allows for a model or classifier to be generated or trained with training data in which the expected output is known in advance such as when the ground truth location for a communication is known.
- a machine learning algorithm uses an unsupervised learning approach.
- unsupervised learning the algorithm generates a function to describe hidden structures from unlabeled data (e.g., a classification or categorization is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm.
- Approaches to unsupervised learning include: clustering, anomaly detection, and neural networks.
- a machine learning algorithm uses a semi-supervised learning approach.
- Semi-supervised learning combines both labeled and unlabeled data to generate an appropriate function or classifier.
- Semi -supervised learning is usually used in data augmentation.
- a machine learning algorithm uses a reinforcement learning approach. In reinforcement learning, the algorithm learns a policy of how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
- a machine learning algorithm learns in batches based on the training dataset and other inputs for that batch. In other embodiments, the machine learning algorithm performs on-line learning where the weights and error calculations are constantly updated.
- a machine learning algorithm uses a transduction approach. Transduction is similar to supervised learning but does not explicitly construct a function. Instead, tries to predict new outputs based on training inputs, training outputs, and new inputs. [0117] In some embodiments, a machine learning algorithm uses a “learning to learn” approach. In learning to learn, the algorithm learns its own inductive bias based on previous experience. [0118] In some embodiments, a machine learning algorithm is applied to new or updated emergency data to be re-trained to generate a new prediction model. In some embodiments, a machine learning algorithm or model is re-trained periodically. In some embodiments, a machine learning algorithm or model is re-trained non-periodically.
- a machine learning algorithm or model is re-trained at least once a day, a week, a month, or a year or more. In some embodiments, a machine learning algorithm or model is re-trained at least once every 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 days or more.
- a machine learning algorithm is provided with unlabeled or unclassified data for unsupervised learning, which leaves the algorithm to identify hidden structure amongst the cases (e.g., clustering).
- unsupervised learning is used to identify the representations that are most useful for classifying raw data (e.g., identifying features that help separate subjects into separate cohorts that may be analyzed using different models and/or evaluated with different thresholds or rules).
- unsupervised learning is capable of identifying hidden patterns such as relationships between certain features from the data in the knowledge base that would not be readily apparent to a human.
- one or more sets of training data are generated and provided to a computer-implemented system comprising one or more algorithms for making predictions.
- an algorithm utilizes a predictive model such as a neural network, a decision tree, a support vector machine, or other applicable model.
- a predictive model such as a neural network, a decision tree, a support vector machine, or other applicable model.
- an algorithm is able to form a classifier for generating a classification or prediction according to relevant features.
- the features selected for classification can be classified using a variety of viable methods.
- the trained algorithm comprises a machine learning algorithm.
- the machine learning algorithm is selected from at least one of a supervised, semi -supervised and unsupervised learning, such as, for example, a support vector machine (SVM), a Naive Bayes classification, a random forest, an artificial neural network, a decision tree, a K-means, learning vector quantization (LVQ), regression algorithm (e.g., linear, logistic, multivariate), association rule learning, deep learning, dimensionality reduction and ensemble selection algorithms.
- the machine learning algorithm is a support vector machine (SVM), a Naive Bayes classification, a random forest, or an artificial neural network.
- Machine learning techniques include bagging procedures, boosting procedures, random forest algorithms, and combinations thereof.
- Tumor samples were obtained from subjects having HNSCC (Adkins), bladder cancer, and melanoma. RNA extraction was performed on the tumor samples and used for subsequent library generation using the Lexogen QuantSeq 3’ mRNA-Seq library Prep Kit FWD for Illumina. The mRNA library was subjected to next generation sequencing using the Illumina NextSeq sequencing platform to generate gene expression data.
- Single-sample gene set enrichment analysis (ssGSEA) was conducted according to gene sets derived from MSigDB, including KEGG and BioCarta. The 24 gene sets listed in Table 2 were subjected to GSEA to determine scores for each of the gene sets.
- the ssGSEA analysis produced a set of 24 enrichment scores for the 24 corresponding gene sets for the HNSCC tumor samples. These 24 enrichment scores of the tumor samples were used to train a machine learning model using linear principal component analysis (PCA) and support vector machine (SVM) methods in order to predict objective response and survival.
- PCA principal component analysis
- SVM support vector machine
- the trained model (the “ssGSEA biomarker model”) was then evaluated for ability to predict treatment outcome. As shown in FIG. 1, the model was evaluated using an Out Of Bag Receiver Operating Characteristic (OOB ROC) analysis, which is a way to estimate model performance on untrained datasets.
- OOB ROC Out Of Bag Receiver Operating Characteristic
- AUC Area Under the Curve
- FIG. 2 is a plot showing the mean scores of individual samples in the training set (on average across OOB samplings). These data shows a 96% negative predictive value (NPV) and 93% sensitivity (SN).
- NPV negative predictive value
- SN 93% sensitivity
- the ssGSEA biomarker model applied to the training set has the performance shown in Table 4.
- DCR disease control rate
- the DCR is the percentage of patients who had a treatment response (e.g, patients who achieved complete response, partial response, or stable disease to treatment) and is similar to “likelihood of response”.
- I/O immune-oncology
- the output scores were grouped into four quartiles QI, Q2, Q3, and Q4, with QI having the lowest 25% of scores and Q4 having the highest 25% of scores. The lower the score, the lower the anticipated benefit of the drug, as evidenced by the correlation between quartile and DCR.
- the QI and Q2 divisions show a low DCR (less than 10%), whereas Q3 and Q4 have a high DCR (greater than about 40%).
- the expected DCR in response to I/O treatment for HNSCC patients is about 30%. Therefore, if a patient’s sample has a high score and the score falls into Q3 or Q4, physicians may recommend I/O treatment, as HNSCC patients in these categories have a DCR in response to I/O of greater than about 40%.
- HNSCC ssGSEA biomarker model achieved superior results compared to a clinically used biomarker PD-L1 model (FIG. 6).
- the present disclosure provides a method according to the following embodiments:
- Embodiment 1 A method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising: obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets linked to the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets linked to the plurality of biological features, thereby generating an output; and generating a determination indicative of a treatment outcome based on the output.
- Embodiment 2 The method of embodiment 1, wherein the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
- GO gene ontology
- Embodiment 3 The method of embodiment 2, wherein the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, interferon gamma, antigen presentation, T-cell exhaustion, or any combination thereof.
- Embodiment 4 The method of embodiment 2, wherein the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1.
- Embodiment 5 The method of embodiment 2, wherein the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets from a molecular signature database (MSigDB).
- MSigDB molecular signature database
- Embodiment 6 The method of embodiment 5, wherein the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
- Embodiment 7. The method of embodiment 1, wherein the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
- Embodiment 8 The method of embodiment 1, further comprising obtaining the biological sample of said subject.
- Embodiment 9 The method of embodiment 8, wherein said biological sample is a solid tumor or liquid biopsy.
- Embodiment 10 The method of embodiment 8, wherein said biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
- Embodiment 11 The method of embodiment 8, wherein said biological sample comprises cancer tissue.
- Embodiment 12 The method of embodiment 11, wherein said cancer tissue comprises tumor-infiltrating immune cells.
- Embodiment 13 The method of embodiment 11, wherein said biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
- Embodiment 14 The method of embodiment 1, further comprising processing said biological sample to prevent or inhibit tissue degradation.
- Embodiment 15 The method of embodiment 14, wherein said biological sample is processed into a formalin-fixed paraffin-embedded sample.
- Embodiment 16 The method of embodiment 1, further comprising extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data.
- Embodiment 17 The method of embodiment 16, wherein said RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data,
- Embodiment 18 The method of embodiment 1, wherein said disease or condition is cancer
- Embodiment 19 The method of embodiment 18, wherein said cancer is a solid cancer or a hematopoietic cancer.
- Embodiment 20 The method of embodiment 18, wherein said cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
- Embodiment 21 The method of embodiment 20, further comprising selecting said subject for prediction of said treatment outcome based on said status.
- Embodiment 22 The method of embodiment 21, wherein said treatment outcome corresponds to one or more cancer treatments.
- Embodiment 23 The method of embodiment 22, wherein said one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
- Embodiment 24 The method of embodiment 22, wherein said subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
- Embodiment 25 The method of embodiment 24, wherein said subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
- Embodiment 26 A method for generating a trained machine learning model configured to generate a prediction of treatment outcome, comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
- Embodiment 27 The method of embodiment 26, wherein said plurality of biological samples are obtained from said subjects prior to receiving said treatment and said subjects are classified according to said treatment outcome after receiving said treatment.
- Embodiment 28 The method of embodiment 26, further comprising configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
Abstract
Disclosed herein are platforms, systems, methods, and media for analyzing gene expression data to predict treatment outcome for cancer patients. Machine learning models can be used to evaluate enrichment scores derived from gene set enrichment analysis to provide accurate predictions.
Description
MACHINE LEARNING SYSTEMS AND METHODS FOR GENE SET ENRICHMENT
ANALYSIS AND SCORING
CROSS REFERENCE
[0001] This application claims the benefit of U.S. Provisional App. No. 63/346,718, filed on May 27, 2022, which is incorporated by reference in its entirety herein.
BACKGROUND
[0002] Cancer is a complex group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. Millions of new cases of cancer occur globally each year. Understanding the immune and tumor profile may help with diagnosis and treatment.
SUMMARY
[0003] Disclosed herein, in some embodiments, are systems and methods for analyzing complex data signals using artificial intelligence or machine learning algorithms to determine output pertaining to the state or status of one or more parameters.
[0004] In one aspect, the present disclosure discloses a method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition comprising obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and generating a determination indicative of the treatment outcome based on the output.
[0005] In some embodiments, the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway. In some embodiments, the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof. In some embodiments, the plurality of gene sets comprises 1, 2, 3, 4, 5, or 6 gene sets listed in Table 1. In some embodiments, the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
[0006] In some embodiments, wherein the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
[0007] In some embodiments, the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database. In some embodiments, the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
[0008] In some embodiments, the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
[0009] In some embodiments, the method disclosed herein further comprises obtaining the biological sample of said subject. In some embodiments, the biological sample is a solid tumor or liquid biopsy. In some embodiments, the biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample. In some embodiments, the biological sample comprises cancer tissue. In some embodiments, the cancer tissue comprises tumor-infiltrating immune cells. In some embodiments, the biological sample is a mixed sample comprising said cancer tissue and noncancer cells.
[0010] In some embodiments, the method disclosed herein further comprises processing said biological sample to prevent or inhibit tissue degradation. In some embodiments, the biological sample is processed into a formalin-fixed paraffin-embedded sample.
[0011] In some embodiments, the method disclosed herein further comprises extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data. In some embodiments, the RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
[0012] In some embodiments, the disease or condition is cancer. In some embodiments, the cancer is a solid cancer or a hematopoietic cancer. In some embodiments, the cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
[0013] In some embodiments, the method disclosed herein further comprises selecting said subject for prediction of said treatment outcome based on said status. In some embodiments, the treatment outcome corresponds to one or more cancer treatments. In some embodiments, the one
or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy. In some embodiments, the subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy. In some embodiments, the subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
[0014] In some embodiments, the method disclosed herein further comprises selecting said subject for generating said determination indicative of said treatment outcome based on a current status of said disease or condition.
[0015] In some embodiments, the subject is treated based at least on said determination indicative of said treatment outcome.
[0016] In some embodiments, the subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
[0017] Also provided herein is a computer-implemented system for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising a processor and non-transitory computer readable storage medium comprising instructions that, when executed by the processor, causes the processor to: (i) obtain gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; (ii) conduct a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; (iii) process, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and (iv) generate a determination indicative of the treatment outcome based on the output.
[0018] In some embodiments, the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway. In some embodiments, the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
[0019] In some embodiments, the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1. In some embodiments, the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2. In some embodiments, the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
[0020] In some embodiments, the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database. In some embodiments, the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
[0021] In some embodiments, the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
[0022] In some embodiments, the processor is configured to obtain the gene expression data for the biological sample of said subject from a database. In some embodiments, the biological sample is a solid tumor or liquid biopsy. In some embodiments, the biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample. In some embodiments, the biological sample comprises cancer tissue. In some embodiments, the cancer tissue comprises tumor-infiltrating immune cells. In some embodiments, the biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
[0023] In some embodiments, the biological sample is processed to prevent or inhibit tissue degradation. In some embodiments, the biological sample is processed into a formalin-fixed paraffin-embedded sample.
[0024] In some embodiments, the RNA is extracted from said biological sample, an RNA library is generated from said extracted RNA, and RNA-Seq is performed on the RNA library to generate said gene expression data. In some embodiments, the RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
[0025] In some embodiments, the disease or condition is cancer. In some embodiments, the cancer is a solid cancer or a hematopoietic cancer. In some embodiments, the cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
[0026] In some embodiments, the subject is selected for prediction of said treatment outcome based on said status. In some embodiments, the treatment outcome corresponds to one or more cancer treatments. In some embodiments, the one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy. In some embodiments, the subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy. In some embodiments, the subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
[0027] In some embodiments, the subject is selected for evaluation to generate said determination indicative of said treatment outcome based on a current status of said disease or condition.
[0028] In some embodiments, the subject is treated based at least on said determination indicative of said treatment outcome.
[0029] In some embodiments, the subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
[0030] Also provided herein is a method for generating a trained machine learning model configured to generate a prediction of treatment outcome, comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
[0031] In some embodiments, the plurality of biological samples is obtained from said subjects prior to receiving said treatment and said subjects are classified according to said treatment outcome after receiving said treatment.
[0032] In some embodiments, the method further comprises configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
[0033] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure.
Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
BRIEF DESCRIPTION OF THE DRAWINGS
[0034] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
[0035] The novel features of the disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings of which:
[0036] FIG. 1 shows a receiver operating characteristic (ROC) curve of false positive rate (FPR) vs. true positive rate (TPR) of a machine learning model trained on a gene set enrichment analysis (GSEA) training set for clinical outcome according to one or more embodiments herein . Bootstrapped datasets, which contain a subset of the whole HNSCC dataset while leaving out a subset of the dataset for testing, were iteratively generated. On each iteration, a model was generated with the bootstrapped dataset, and an AUC value was calculated from the held out test set. Shown is the mean of all the AUCs calculated. Shading represents the confidence intervals.
[0037] FIG. 2 shows a graph of training samples across out of bag (OOB) samplings of a GSEA training set for clinical outcome according to one or more embodiments herein;
[0038] FIG. 3 shows a graph of a percentage of patients that had a response to treatment (disease control rate, DCR) per score division (quartile) of a GSEA training set for clinical outcome according to one or more embodiments herein;
[0039] FIG. 4 shows a non-limiting example of a computing device; in this case, a device with one or more processors, memory, storage, and a network interface; and
[0040] FIG. 5 shows a non-limiting example of a workflow for processing a biological sample and using gene set enrichment analysis and machine learning model to predict a response to therapy or a treatment outcome.
[0041] FIG. 6 shows ROC curves of false positive rate (specificity) vs. true positive rate (sensitivity) for models, a single-sample GSEA (ssGSEA) biomarker model and a PD-L1 biomarker model. The ssGSEA model shows better performance on a Head and Neck Squamous Cell Carcinoma (HNSCC) dataset than does the clinically used PD-L1 biomarker model biomarker model. Shown are the mean OOB prediction scores of each of the samples used to train the model to build a single ROC curve, with a single value for AUC. FIG. 1 and FIG. 6 use slightly different forms of the GSEA biomarker model based on the same dataset.
DETAILED DESCRIPTION
[0042] Disclosed herein are platforms, systems, methods, and media for analyzing gene expression data to predict treatment outcome for cancer patients. Machine learning models can be trained and used to evaluate enrichment scores derived from gene set enrichment analysis to provide accurate predictions.
[0043] While some approaches seek to identify the fraction or amount of immune cell types that have infiltrated a tumor sample and leverage this information to make predictions, such approaches tend to rely on deconvolution algorithms. The expression level of one or more biomarker genes may be quantified directly and combined with immune cell information to make up a feature set for statistical analysis. By contrast, the instant disclosure includes the discovery that a computationally simpler and more coherent approach using gene set enrichment analysis can provide accurate predictions without relying on such algorithms for quantifying immune cells within a sample. Thus, gene set enrichment scores may be directly used as features in a machine learning model to predict treatment outcome (e.g., response to immunotherapy) without going through an unnecessary intermediate step of deconvolving gene expression data to quantify immune cells and then using the quantified numbers as input features. Instead of using features corresponding to individual genetic biomarkers, this approach of using gene sets has demonstrated surprisingly accurate performance across multiple cancer types such as HNSCC, in which it has achieved superior results compared to the clinically used biomarker programmed
death-ligand 1 (PD-L1, also known as CD274) (FIG. 6). PD-L1 inhibits the adaptive immune response and is often expressed at high levels in cancer cells, and therefore was proposed as a potential target for cancer immunotherapy in the clinic.
[0044] The systems and methods disclosed herein can provide highly accurate evaluations or determinations indicative of an outcome. Examples of performance metrics include accuracy, specificity, sensitivity, positive predictive value, negative predictive value, and receiver operating characteristic/ area under receiver operating characteristic (ROC/AUROC). Any combination of these metrics may be determined for a machine learning model or classifier by testing it against a set of independent samples. True positive (TP) is a positive test result that detects the condition when the condition is present (e.g., positive response to cancer treatment). True negative (TN) is a negative test result that does not detect the condition when the condition is absent. False positive (FP) is a test result that detects the condition when the condition is absent. False negative (FN) is a test result that does not detect the condition when the condition is present. The performance metrics of accuracy, specificity, sensitivity, positive predictive value, and negative predictive value can then be defined according to the following formulas:
Accuracy = (TP + TN) / (TP + FP + FN + TN) Specificity (“true negative rate”) = TN / (TN + FP) Sensitivity (“true positive rate”) = TP / (TP + FN) Positive predictive value (PPV or “precision”) = TP / (TP + FP) Negative predictive value (NPV) = TN / (TN + FN).
[0045] The AUROC can be determined by creating the ROC curve which entails plotting the true positive rate (TP) against the false positive rate (FP) and varies between 0 and 1. A sample may be evaluated according to the systems and methods disclosed herein to generate an evaluation or determination such as a prediction of treatment outcome that provide a minimum threshold of performance. In some cases, the analytical algorithm or module (e.g., comprising a machine learning model) has an accuracy of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has a specificity of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has a sensitivity of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has a PPV of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has an NPV of at least about 50%, 60%, 70%, 80%, 90%, or 95%. In some cases, the analytical algorithm or module has an ROC of at least about 0.6, 0.7, 0.8, 0.85, 0.9, or 0.95 or higher.
[0046] In some embodiments, the methods disclosed herein comprise processing a biological sample to obtain gene expression data and performing gene set expression analysis on the gene expression data to generate an evaluation or prediction of outcome.
[0047] An illustrative and non-limiting embodiment of a workflow process is depicted in FIG. 5. In a first step a biological sample is processed to extract RNA 501. The biological sample may be a formalin-fixed paraffin-embedded (FFPE) sample. Next, the extracted RNA is used to generate an mRNA-Seq library 502. Various suitable methods may be used for library generation including commercial kits such as, for example, the QuantSeq 3’ mRNA-Seq library prep kit. Next Generation Sequencing is then performed on the library 503. Various suitable platforms may be used for the sequencing, for example, the NextSeq platform by Illumina. Next, single-sample Gene Set Enrichment Analysis (ssGSEA) is performed on the gene expression data generated from the sequencing 504. Various gene sets may be used including independently curated gene sets as well as from public databases such as, for example, gene sets obtained from MSigDB. The gene sets can be derived from various collections such as hallmark gene sets, positional gene sets, curated gene sets, chemical and genetic perturbations, canonical pathways, regulatory target, microRNA targets, transcription factor targets, computational gene sets, cancer gene neighborhoods, cancer modules, ontology gene sets, Gene Ontology derived gene sets, oncogenic signature gene sets, immunologic signature gene sets, cell type signature gene sets, or any combination thereof. Subsets of the canonical pathways gene sets include gene sets derived from BioCarta pathway database, KEGG pathway database, PID pathway database, Reactome pathway database, and WikiPathways pathway database. The ssGSEA can be used to generate an output corresponding to the gene sets that have been evaluated using the gene expression data. The output can be a metric or a score, for example, an enrichment score for each gene set. Accordingly, the machine learning model analyzes the enrichment scores corresponding to the gene sets to predict a response to therapy 505. Non-limiting examples of gene sets suitable for use according to the systems and methods disclosed herein are provided in Table 2.
[0048] In this example, each enrichment score for a gene set forms a feature that makes up part of the input to the trained machine learning model. The response to therapy can be any suitable metric, indicator, or classification. For example, a regression model may output a number between 0 and 1 indicative of responsiveness to therapy. Alternatively, a classifier may generate a classification between two or more categories such as, for example, response to treatment, partial response to treatment, no response to treatment, survival, etc. Various suitable machine learning models can be used. Support vector machine (SVM) is suitable for both regression and classification analysis and can provide a high level of accuracy without requiring significant
computing power. Alternatively, or in combination, principal component analysis (PCA) can be used to analyze the ssGSEA output. In some cases, the machine learning model is configured to process at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, or 100 gene set metrics (e.g., enrichment scores). In some cases, the machine learning model is configured to process no more than 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, of 500 gene sets. In some cases, each gene set independently comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes. In some cases, each gene set independently comprises no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes. As an illustrative example, gene set enrichment analysis may be performed on 10 different gene sets, one of which has 10 genes and one of which has 200 genes. In some cases, the systems and methods disclosed herein utilize 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 gene sets, each of which independently comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes and/or no more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 350, 400, 450, or 500 genes.
[0049] In some embodiments, one or more of the gene sets used as features in the predictive model (e.g., machine learning model) do not utilize the full list of genes within a known gene set. For example, a less than 100% fraction of the genes in a given gene set may be used for calculating an output or metric for that gene set. In some cases, for a given gene set (such as any one or more of those listed in Table 2), at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the genes listed for the gene set are used to calculate the output or metric for the gene set. In some cases, for a given gene set (such as any one or more of those listed in Table 2), no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the genes listed for the gene set are used to calculate the output or metric for the gene set. For instance, gene set 1 (“HALLMARK EPITHELIAL MESENCHYMAL TRANSITION”) from Table 2 includes 200 genes associated with epithelial to mesenchymal transition. As an illustrative example, the gene set enrichment analysis performed according to the systems and methods disclosed herein may utilize 50% of the genes in this gene set with respect to epithelial
to mesenchymal transition in combination with certain independently determined percentages of other gene sets in Table 1.
[0050] As discussed above, any combination of genes within each gene set may be used for gene set enrichment analysis to generate a corresponding output metric such as an enrichment score. Then the output for a plurality of gene sets can be used as input features provided to a machine learning algorithm or model to generate a composite score indicative of a prediction such as an outcome or treatment outcome. The identities of the genes making up each gene set listed in Table 2 can be found on the publicly accessible database MSigDB and are also listed in Table 3, which shows the gene member identification used by MSigDB alongside the corresponding NCBI Gene ID and Gene Symbol.
[0051] The output can make up the features that are processed using an algorithm such as a trained model generated using machine learning to generate an evaluation such as a predicted treatment outcome.
[0052] Provided herein are systems and methods for generating a predicted treatment outcome from a sample of a subject. In some instances, the subject has or is suspected of having a disease or disorder. The disease or disorder can be a cancer. In some instances, the predicted treatment outcome is for an immunotherapy targeting a cancer.
[0053] In some instances, the methods disclosed herein comprise obtaining a sample from a subject. In some instances, the sample is any fluid or other material derived from the body of a normal or disease subject including, but not limited to, blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, milk, amniotic fluid, bile, ascites fluid, organ or tissue extract, and culture fluid in which any cells or tissue preparation from a subject has been incubated. In some instances, the sample is obtained from skin, blood, brain, bladder, bone, bone marrow, breast, colon, stomach, esophagus, ovary, uterus, gallbladder, fallopian tube, testicle, kidney, liver, pancreas, adrenal gland, cervix, endometrium, head or neck, lung, prostate, thymus, thyroid, lymph node, or urinary bladder. In some instances, the sample is a cancer sample or biopsy. The cancer sample is typically a solid tumor sample or a liquid tumor sample. For example, the cancer sample can be obtained from excised tissue. In some instances, the samples, is fresh, frozen, or fixed. In some instances, a fixed sample comprises paraffin-embedded or fixation by formalin, formaldehyde, or gluteraldehyde. In some instances, the sample is formalin-fixed paraffin-embedded.
[0054] In some instances, the sample is stored after it has been collected, but before additional steps are to be performed. In some instances, the sample is stored at less than 8° C. In some instances, the sample is stored at less than 4° C. In some instances, the sample is stored at less than 0° C. In some instances, the sample is stored at less than -20° C. In some instances, the
sample is stored at less than -70° C. In some instances, the sample is stored a solution comprising glycerol, glycol, dimethyl sulfoxide, growth media, nutrient broth or any combination thereof. The sample may be stored for any suitable period of time. In some instances, the sample is stored for any period of time and remains suitable for downstream applications. For example, the sample is stored for any period of time before nucleic acid (e.g., ribonucleic acid (RNA) or deoxyribonucleic acid (DNA)) extraction. In some instances, the sample is stored for at least or about 1 day, 2 day, 3 days, 4 days, 5 days, 6 days, 7 days, 1 week, 2 weeks, 3 weeks, 4 weeks, 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, 7 months, 8 months, 9 months, 10 months, 11 months, 12 months, or more than 12 months. In some instances, the sample is stored for at least 1 year, 2 years, 3, years, 4 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years, 11 years, 12 years, or more than 12 years.
[0055] Methods and systems as described herein comprise generating an immune-oncology profile from a sample of a subject, wherein the sample comprises a nucleic acid molecule. In some instances, the nucleic acid molecule is RNA, DNA, fragments, or combinations thereof. In some instances, after a sample is obtained, the sample is processed further before analysis. In some instances, the sample is processed to extract the nucleic acid molecule from the sample. In some instances, no extraction or processing procedures are performed on the sample. In some instances, the nucleic acid is extracted using any technique that does not interfere with subsequent analysis. Extraction techniques include, for example, alcohol precipitation using ethanol, methanol or isopropyl alcohol. In some instances, extraction techniques use phenol, chloroform, or any combination thereof. In some instances, extraction techniques use a column or resin based nucleic acid purification scheme such as those commonly sold commercially. In some instances, following extractions, the nucleic acid molecule is purified. In some instances, the nucleic acid molecule is further processed. For example, following extraction and purification, RNA is further reverse transcribed to cDNA. In some instances, processing of the nucleic acid comprises amplification. Following extraction or processing, in some instances, the nucleic acid is stored in water, Tris buffer, or Tris-EDTA buffer before subsequent analysis. In some instances, the sample is stored at less than 8° C. In some instances, the sample is stored at less than 4° C. In some instances, the sample is stored at less than 0° C. In some instances, the sample is stored at less than -20° C. In some instances, the sample is stored at less than -70° C. In some instances, the sample is stored for at least or about 1 day, 2 day, 3 days, 4 days, 5 days, 6 days, 7 days, 1 week, 2 weeks, 3 weeks, 4 weeks, 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, 7 months, 8 months, 9 months, 10 months, 11 months, 12 months, or more than 12 months.
[0056] A nucleic acid molecule obtained from a sample comprises may be characterized by factors such as integrity of the nucleic acid molecule or size of the nucleic acid molecule. In some instances, the nucleic acid molecule is DNA. In some instances, the nucleic acid molecule is RNA. In some instances, the RNA or DNA comprises a specific integrity. For example, the RNA integrity number (RIN) of the RNA is no more than about 2. In some instances, the RNA molecules in a sample have a RIN of about 2 to about 10. In some instances, the RNA molecules in a sample have a RIN of at least about 2. In some instances, the RNA molecules in a sample have a RIN of at most about 10. In some instances, the RNA molecules in a sample have a RIN of about 2 to about 3, about 2 to about 4, about 2 to about 5, about 2 to about 6, about 2 to about 7, about 2 to about 8, about 2 to about 9, about 2 to about 10, about 3 to about 4, about 3 to about 5, about 3 to about 6, about 3 to about 7, about 3 to about 8, about 3 to about 9, about 3 to about 10, about 4 to about 5, about 4 to about 6, about 4 to about 7, about 4 to about 8, about 4 to about 9, about 4 to about 10, about 5 to about 6, about 5 to about 7, about 5 to about 8, about 5 to about 9, about 5 to about 10, about 6 to about 7, about 6 to about 8, about 6 to about 9, about 6 to about 10, about 7 to about 8, about 7 to about 9, about 7 to about 10, about 8 to about 9, about 8 to about 10, or about 9 to about 10. The RNA molecule in a sample may be characterized by size. In some instances, at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, or 90%, or more of the RNA molecules in a sample are at least 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, or more than 400 nucleotides in size. In some instances, the RNA molecules in the sample are at least 200 nucleotides in size. In some instances, the RNA molecules of at least 200 nucleotides in size comprise a percentage of the sample (DV200). For example, the percentage is at least or about 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or more than 95%. In some instances, the RNA molecules in a sample have a DV200 value of about 10% to about 90%. In some instances, the RNA molecules in a sample have a DV200 value of at least about 10%. In some instances, the RNA molecules in a sample have a DV200 value of at most about 90%. In some instances, the RNA molecules in a sample have a DV200 value of about 10% to about 20%, about 10% to about 30%, about 10% to about 40%, about 10% to about 50%, about 10% to about 60%, about 10% to about 70%, about 10% to about 80%, about 10% to about 90%, about 20% to about 30%, about 20% to about 40%, about 20% to about 50%, about 20% to about 60%, about 20% to about 70%, about 20% to about 80%, about 20% to about 90%, about 30% to about 40%, about 30% to about 50%, about 30% to about 60%, about 30% to about 70%, about 30% to about 80%, about 30% to about 90%, about 40% to about 50%, about 40% to about 60%, about 40% to about 70%, about 40% to about 80%, about 40% to about 90%, about 50% to about 60%, about 50% to about 70%,
about 50% to about 80%, about 50% to about 90%, about 60% to about 70%, about 60% to about 80%, about 60% to about 90%, about 70% to about 80%, about 70% to about 90%, or about 80% to about 90%.
[0057] In some instances, after the samples have been obtained and nucleic acid molecule isolated, the nucleic acid molecule is prepared for sequencing. In some instances, a sequencing library is prepared. Numerous library generation methods have been described. In some instances, methods for library generation comprise addition of a sequencing adapter. Sequencing adapters may be added to the nucleic acid molecule by ligation. In some instances, library generation comprises an end-repair reaction.
[0058] Sometimes, library generation for sequencing comprises an enrichment step. For example, coding regions of the mRNA are enriched. In some instances, the enrichment step is for a subset of genes. In some instances, the enrichment step comprises using a bait set. The bait set may be used to enrich for genes used for specific downstream applications. A bait set generally refers to a set of baits targeted toward a selected set of genomic regions of interest. For example, a bait set may be selected for genomic regions relating to at least one of immune modulatory molecule expression, cell type and ratio, or mutational burden. In some instances, one bait set is used for determining immune modulatory molecule expression, a second bait set is used for determining cell type and ratio, and a third bait set is used for determining mutational burden. In some instances, the same bait set is used for determining immune modulatory molecule expression, cell type and ratio, mutational burden, or combinations thereof. In some instances, a bait set comprises at least one unique molecular identifier (UMI). The term “unique molecular identifier (UMI)” or “UMI” as used herein refers to nucleic acid having a sequence which can be used to identify and/or distinguish one or more first molecules to which the UMI is conjugated from one or more second molecules. In some instances, the UMI is conjugated to one or more target molecules of interest or amplification products thereof. UMIs may be single or double stranded.
[0059] The systems and methods disclosed herein provide for the sequencing for a number of genes. In some instances, the number of genes is at least about 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, or more than 10000 genes. In some instances, the number of genes to be sequenced is in a range of about 500 to about 1000 genes. In some instances, the number of genes to be sequenced is in a range of about at least 200. In some instances, the number of genes to be sequenced is in a range of about at most 10,000. In some instances, the number of genes to be sequenced is in a range of about 200 to 500, 200 to 1,000, 200 to 2,000,
200 to 4,000, 200 to 6,000, 200 to 8,000, 200 to 10,000, 500 to 1,000, 500 to 2,000, 500 to 4,000, 500 to 6,000, 500 to 8,000, 500 to 10,000, 1,000 to 2,000, 1,000 to 4,000, 1,000 to 6,000, 1,000 to 8,000, 1,000 to 10,000, 2,000 to 4,000, 2,000 to 6,000, 2,000 to 8,000, 2,000 to 10,000, 4,000 to 6,000, 4,000 to 8,000, 4,000 to 10,000, 6,000 to 8,000, 6,000 to 10,000, or 8,000 to 10,000.
[0060] Sequencing may be performed with any appropriate sequencing technology. Examples of sequencing methods include, but are not limited to single molecule real-time sequencing, Polony sequencing, sequencing by ligation, reversible terminator sequencing, proton detection sequencing, ion semiconductor sequencing, nanopore sequencing, electronic sequencing, pyrosequencing, Maxam-Gilbert sequencing, chain termination (e.g., Sanger) sequencing, +S sequencing, or sequencing by synthesis.
[0061] Sequencing methods may include, but are not limited to, one or more of: high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq (Illumina), Digital Gene Expression (Helicos), Next generation sequencing, Single Molecule Sequencing by Synthesis (SMSS) (Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, Maxim-Gilbert sequencing, and primer walking. Sequencing may generate sequencing reads (“reads”), which may be processed (e.g., alignment) to yield longer sequences, such as consensus sequences. Such sequences may be compared to references (e.g., a reference genome or control) to identify variants, for example. [0062] An average read length from sequencing may vary. In some instances, the average read length is at least about 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, or more than 80000 base pairs. In some instances, the average read length is in a range of about 100 to 80,000. In some instances, the average read length is in a range of about at least 100. In some instances, the average read length is in a range of about at most 80,000. In some instances, the average read length is in a range of about 100 to 200, 100 to 300, 100 to 500, 100 to 1,000, 100 to 2,000, 100 to 4,000, 100 to 8,000, 100 to 10,000, 100 to 20,000, 100 to 40,000, 100 to 80,000, 200 to 300, 200 to 500, 200 to 1,000, 200 to 2,000, 200 to 4,000, 200 to 8,000, 200 to 10,000, 200 to 20,000, 200 to 40,000, 200 to 80,000, 300 to 500, 300 to 1,000, 300 to 2,000, 300 to 4,000, 300 to 8,000, 300 to 10,000, 300 to 20,000, 300 to 40,000, 300 to 80,000, 500 to 1,000, 500 to 2,000, 500 to 4,000, 500 to 8,000, 500 to 10,000, 500 to 20,000, 500 to 40,000, 500 to 80,000, 1,000 to 2,000, 1,000 to 4,000, 1,000 to 8,000, 1,000 to 10,000, 1,000 to 20,000, 1,000 to 40,000, 1,000 to 80,000, 2,000 to 4,000, 2,000 to 8,000, 2,000 to 10,000, 2,000 to 20,000,
2,000 to 40,000, 2,000 to 80,000, 4,000 to 8,000, 4,000 to 10,000, 4,000 to 20,000, 4,000 to 40,000, 4,000 to 80,000, 8,000 to 10,000, 8,000 to 20,000, 8,000 to 40,000, 8,000 to 80,000, 10,000 to 20,000, 10,000 to 40,000, 10,000 to 80,000, 20,000 to 40,000, 20,000 to 80,000, or 40,000 to 80,000.
[0063] In some instances, a number of nucleotides that are sequenced are at least or about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500, 2000, 2500, 3000, or more than 3000 nucleotides. In some instances, the number of nucleotides that are sequenced are about 5 to about 3,000 nucleotides. In some instances, the number of that are sequenced are at least 5 nucleotides. In some instances, the number of nucleotides that are sequenced are at most 3,000 nucleotides. In some instances, the number of nucleotides that are sequenced are 5 to 50, 5 to 100, 5 to 200, 5 to 400, 5 to 600, 5 to 800, 5 to 1,000, 5 to 1,500, 5 to 2,000, 5 to 2,500, 5 to 3,000, 50 to 100, 50 to 200, 50 to 400, 50 to 600, 50 to 800, 50 to 1,000, 50 to 1,500, 50 to 2,000, 50 to 2,500, 50 to 3,000, 100 to 200, 100 to 400, 100 to 600, 100 to 800, 100 to 1,000, 100 to 1,500, 100 to 2,000, 100 to 2,500, 100 to 3,000, 200 to 400, 200 to 600, 200 to 800, 200 to 1,000, 200 to 1,500, 200 to 2,000, 200 to 2,500, 200 to 3,000, 400 to 600, 400 to 800, 400 to 1,000, 400 to 1,500, 400 to 2,000, 400 to 2,500, 400 to 3,000, 600 to 800, 600 to 1,000, 600 to
1.500, 600 to 2,000, 600 to 2,500, 600 to 3,000, 800 to 1,000, 800 to 1,500, 800 to 2,000, 800 to
2.500, 800 to 3,000, 1,000 to 1,500, 1,000 to 2,000, 1,000 to 2,500, 1,000 to 3,000, 1,500 to 2,000, 1,500 to 2,500, 1,500 to 3,000, 2,000 to 2,500, 2,000 to 3,000, or 2,500 to 3,000 nucleotides.
[0064] Sequencing methods may include a barcoding or “tagging” step. In some instances, barcoding (or “tagging”) can allow for generation of a population of samples of nucleic acids, wherein each nucleic acid can be identified from which sample the nucleic acid originated. In some instances, the barcode comprises oligonucleotides that are ligated to the nucleic acids. In some instances, the barcode is ligated using an enzyme, including but not limited to, E. coli ligase, T4 ligase, mammalian ligases (e.g., DNA ligase I, DNA ligase II, DNA ligase III, DNA ligase IV), thermostable ligases, and fast ligases.
[0065] Barcoding or tagging may occur using various types of barcodes or tags. Examples of barcodes or tags include, but are not limited to, a radioactive barcode or tag, a fluorescent barcode or tag, an enzyme, a chemiluminescent barcode or tag, and a colorimetric barcode or tag. In some instances, the barcode or tag is a fluorescent barcode or tag. In some instances, the fluorescent barcode or tag comprises a fluorophore. In some instances, the fluorophore is an aromatic or heteroaromatic compound. In some instances, the fluorophore is a pyrene, anthracene, naphthalene, acridine, stilbene, benzoxaazole, indole, benzindole, oxazole, thiazole,
benzothiazole, canine, carbocyanine, salicylate, anthranilate, xanthenes dye, coumarin.
Examples of xanthene dyes include, e.g., fluorescein and rhodamine dyes. Fluorescein and rhodamine dyes include, but are not limited to, 6-carboxyfluorescein (FAM), 2'7'-dimethoxy- 4'5'-dichloro-6-carboxyfluorescein (JOE), tetrachlorofluorescein (TET), 6-carboxyrhodamine (R6G), N,N,N,N'-tetramethyl-6-carboxyrhodamine (TAMRA), 6-carboxy-X-rhodamine (ROX). In some instances, the fluorescent barcode or tag also includes the naphthylamine dyes that have an amino group in the alpha or beta position. For example, naphthylamino compounds include l-dimethylaminonaphthyl-5-sulfonate, l-anilino-8-naphthalene sulfonate and 2-p-toluidinyl-6- naphthalene sulfonate, 5-(2'-aminoethyl)aminonaphthalene-l -sulfonic acid (EDANS). Examples of coumarins include, e.g., 3-phenyl-7-isocyanatocoumarin; acridines, such as 9- isothiocyanatoacridine and acridine orange; N-(p-(2-benzoxazolyl)phenyl) maleimide; cyanines, such as, e.g., indodi carbocyanine 3 (Cy3), indodicarbocyanine 5 (Cy5), indodicarbocyanine 5.5 (Cy5.5), 3-(-carboxy-pentyl)-3'-ethyl-5,5'-dimethyloxacarbocyanine (CyA); 1H, 5H, 11H, 15H- Xantheno[2,3, 4-ij : 5,6,7-i'j ']diquinolizin-l 8-ium, 9-[2 (or 4)-[[[6-[2,5-dioxo-l- pyrrolidinyl)oxy]-6-oxohexyl]amino]sulfonyl]-4 (or 2)-sulfophenyl]-2,3, 6,7, 12,13, 16,17- octahydro-inner salt (TR or Texas Red); or BODIPY™ dyes.
[0066] In some instances, a different barcode or tag is supplied a sample comprising nucleic acids. Examples of barcode lengths include barcode sequences comprising, without limitation, about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25 or more bases in length. Examples of barcode lengths include barcode sequences comprising, without limitation, from 1-5, 1-10, 5-20, or 1-25 bases in length. Barcode systems may be in base 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 or a similar coding scheme. In some instances, a number of barcodes is at least or about 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000, 18000, 20000, 25000, 30000, 40000, 50000, 100000, 500000, 1000000, or more than 1000000 barcodes. In some instances, a number of barcodes is in a range of 1-1000000 barcodes. In some instances, the number of barcodes is in a range of about 1-10 1-50 1-100 1-500 1-1000 1-5,000 1-10000 1-50000 1-100000 1-500000 1-1000000 10-50 10-100 10-500 10-1000 10-5,000 10-10000 10-50000 10-100000 10-500000 10-1000000 50-100 50-500 50-1000 50-5,000 50-10000 50-50000 50-100000 50-500000 50-1000000 100- 500 100-1000 100-5,000 100-10000 100-50000 100-100000 100-500000 100-1000000 500- 1000 500-5,000 500-10000 500-50000 500-100000 500-500000 500-1000000 1000-5,000 1000- 10000 1000-50000 1000-100000 1000-500000 1000-1000000 5,000-10000 5,000-50000 5,000- 100000 5,000-500000 5,000-1000000 10000-50000 10000-100000 10000-500000 10000-
1000000 50000-100000 50000-500000 50000-1000000 100000-500000 100000-1000000 or 500000-1000000 barcodes.
[0067] Following sequencing of a sample, sequencing data as described herein can be used for performing gene set enrichment analysis. Gene Set Enrichment Analysis (GSEA) is a computational method that seeks to determine whether a predefined set of genes (typically grouped together according to some biological feature such as molecular pathway or function) demonstrates statistically significant differences between two or more categories or biological states (e.g., treatment outcome status). A predefined set of genes may be evaluated to produce an output or metric such as, for example, a score corresponding to the difference between two or more categories or biological states. Multiple sets of genes can be evaluated to generate multiple such outputs or metrics. These outputs or metrics may comprise the features of a model such as a trained machine learning model configured to generate predictions with respect to the categories or biological state. The model may be a regression that generates an output along a continuum (e.g., any value between 0 and 1) or a classifier which generates a classification for a data set.
[0068] Provided herein are systems and methods for processing a biological sample obtained from a subject. The sample often comprises a heterogeneous composition of different cell types and/or subtypes. Sometimes, the sample is a tumor sample. The cell types and/or subtypes that make up the sample includes one or more of cancer cells, non-cancer cells, and/or immune cells. Examples of non-immune cells include salivary gland cells, mammary gland cells, lacrimal gland cells, ceruminous gland cells, eccrine sweat gland cells, apocrine sweat gland cells, sebaceous gland cells, Bowman's gland cells, Brunner's gland cells, prostate gland cells, seminal vesicle cells, bulbourethral gland cells, keratinizing epithelial cells, hair shaft cells, epithelial cells, exocrine secretory epithelial cells, uterus endometrium cells, isolated goblet cells of respiratory and digestive tracts, stomach lining mucous cells, hormone secreting cells, pituitary cells, gut and respiratory tract cells, thyroid gland cells, adrenal gland cells, chromaffin cells, Leydig cells, theca interna cells, macula densa cells of kidney, peripolar cells of kidney, mesangial cells of kidney, hepatocytes, white fat cells, brown fat cells, liver lipocytes, kidney cells, kidney glomerulus parietal cells, kidney glomerulus podocytes, kidney proximal tubule brush border cells, loop of Henle thin segment cells, kidney distal tubule cells, endothelial fenestrated cells, vascular endothelial continuous cells, synovial cells, serosal cells, squamous cells, columnar cells of endolymphatic sac with microvilli, columnar cells of endolymphatic sac without microvilli, vestibular membrane cells, stria vascularis basal cells, stria vascularis marginal cells, choroid plexus cells, respiratory tract ciliated cells, oviduct ciliated cells, uterine
endometrial ciliated cells, rete testis ciliated cells, ductulus efferens ciliated cells, ciliated ependymal cells of central nervous system, organ of Corti interdental epithelial cells, loose connective tissue fibroblasts, corneal fibroblasts, tendon fibroblasts, bone marrow reticular tissue fibroblasts, other nonepithelial fibroblasts, pericytes, skeletal muscle cells, red skeletal muscle cells, white skeletal muscle cells, intermediate skeletal muscle cells, nuclear bag cells of muscle spindle, nuclear chain cells of muscle spindle, satellite cells, cardiac muscle cells, ordinary cardiac muscle cells, nodal cardiac muscle cells, purkinje fiber cells, smooth muscle cells, myoepithelial cells of iris, myoepithelial cells of exocrine glands, erythrocytes, megakaryocytes, monocytes, epidermal Langerhans cells, osteoclasts, sensory neurons, olfactory receptor neurons, pain-sensitive primary sensory neurons, photoreceptor cells of retina in eye, photoreceptor rod cells, proprioceptive primary sensory neurons (various types), touch-sensitive primary sensory neurons, taste bud cells, autonomic neuron cells, Schwann cells, satellite cells, glial cells, astrocytes, oligodendrocytes, melanocytes, germ cells, nurse cells, interstitial cells, and pancreatic duct cells. Various cell types may be evaluated for the sample using methods as described herein including, but not limited to, lymphoid cells, stromal cells, stem cells, and myeloid cells. Examples of lymphoid cells include, but are not limited to, CD4+ memory T- cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells, effector memory T (Tern) cells, CD4+ Tcm, CD4+ Tern, CD8+ T-cells, CD8+ naive T-cells, CD8+ Tcm, CD8+ Tem, regulatory T cells (Tregs), T helper (Th) 1 cells, Th2 cells, gamma delta T (Tgd) cells, natural killer (NK) cells, natural killer T (NKT) cells, B-cells, naive B-cells, memory B-cells, cl ass- switched memory B-cells, pro B-cells, and plasma cells. In some instances, the cells are stromal cells, for example, mesenchymal stem cells, adipocytes, preadipocytes, stromal cells, fibroblasts, pericytes, endothelial cells, microvascular endothelial cells, lymphatic endothelial cells, smooth muscle cells, chondrocytes, osteoblasts, skeletal muscle cells, myocytes. Examples of stem cells include, but are not limited to, hematopoietic stem cells, common lymphoid progenitor cells, common myeloid progenitor cells, granulocyte-macrophage progenitor cells, megakaryocyte-erythroid progenitor cells, multipotent progenitor cells, megakaryocytes, erythrocytes, and platelets. Examples of myeloid cells include, but are not limited to, monocytes, macrophages, macrophages Ml, macrophages M2, dendritic cells, conventional dendritic cells, plasmacytoid dendritic cells, immature dendritic cells, neutrophils, eosinophils, mast cells, and basophils. Other cell types may be evaluated using methods as described herein, for example, epithelial cells, sebocytes, keratinocytes, mesangial cells, hepatocytes, melanocytes, keratocytes, astrocytes, and neurons.
[0069] In some instances, the sequencing data comprises genes that are differentially expressed by various immune cell types. Examples of immune cells to be detected by methods described herein include, but are not limited to, CD4+ memory T-cells, CD4+ naive T-cells, CD4+ T-cells, central memory T (Tcm) cells, effector memory T (Tern) cells, CD4+ Tcm, CD4+ Tern, CD8+ T-cells, CD8+ naive T-cells, CD8+ Tcm, CD8+ Tern, regulatory T cells (Tregs), T helper (Th) 1 cells, Th2 cells, gamma delta T (Tgd) cells, natural killer (NK) cells, natural killer T (NKT) cells, B-cells, naive B-cells, memory B-cells, cl ass- switched memory B-cells, pro B-cells, and plasma cells.
Terms and Definitions
[0070] Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. [0071] As used herein, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
[0072] As used herein, the term “about” in some cases refers to an amount that is approximately the stated amount.
[0073] As used herein, the term “about” refers to an amount that is near the stated amount by 10%, 5%, or 1%, including increments therein.
[0074] As used herein, the term “about” in reference to a percentage refers to an amount that is greater or less the stated percentage by 10%, 5%, or 1%, including increments therein.
[0075] As used herein, the phrases “at least one”, “one or more”, and “and/or” are open-ended expressions that are both conjunctive and disjunctive in operation. For example, each of the expressions “at least one of A, B and C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.
[0076] The present disclosure employs, unless otherwise indicated, conventional molecular biology techniques, which are within the skill of the art. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as is commonly understood by one of ordinary skill in the art.
[0077] Throughout this disclosure, various embodiments are presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of any embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range to the tenth of
the unit of the lower limit unless the context clearly dictates otherwise. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual values within that range, for example, 1.1, 2, 2.3, 5, and 5.9. This applies regardless of the breadth of the range. The upper and lower limits of these intervening ranges may independently be included in the smaller ranges, and are also encompassed within the disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the disclosure, unless the context clearly dictates otherwise.
[0078] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of any embodiment. As used herein, the singular forms “a,” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
[0079] The term “ribonucleic acid” or “RNA,” as used herein refers to a molecule comprising at least one ribonucleotide residue. RNA may include transcripts. By “ribonucleotide” is meant a nucleotide with a hydroxyl group at the 2’ position of a beta-D-ribo-furanose moiety. The term RNA includes, but not limited to, mRNA, ribosomal RNA, tRNA, non-protein-coding RNA (npcRNA), non-messenger RNA, functional RNA (fRNA), long non-coding RNA (IncRNA), pre-mRNAs, and primary miRNAs (pri-miRNAs). The term RNA includes, for example, double-stranded (ds) RNAs; single-stranded RNAs; and isolated RNAs such as partially purified RNA, essentially pure RNA, synthetic RNA, recombinant RNA, as well as altered RNA that differ from naturally-occurring RNA by the addition, deletion, substitution and/or alteration of one or more nucleotides. Such alterations can include addition of non-nucleotide material, such as to the end(s) of the siRNA or internally, for example at one or more nucleotides of the RNA. Nucleotides in the RNA molecules described herein can also comprise non-standard nucleotides, such as non-naturally occurring nucleotides or chemically synthesized nucleotides or deoxynucleotides. These altered RNAs can be referred to as analogs or analogs of naturally- occurring RNA.
[0080] Unless specifically stated or obvious from context, as used herein, the term “about” in reference to a number or range of numbers is understood to mean the stated number and numbers +/- 10% thereof, or 10% below the lower listed limit and 10% above the higher listed limit for the values listed for a range.
[0081] The term “sample,” as used herein, generally refers to a biological sample of a subject. The biological sample may be a tissue or fluid of the subject, such as blood (e.g., whole blood), plasma, serum, urine, saliva, mucosal excretions, sputum, stool and tears. The biological sample may be derived from a tissue or fluid of the subject. The biological sample may be a tumor sample or heterogeneous tissue sample. The biological sample may have or be suspected of having disease tissue. The tissue may be processed to obtain the biological sample. The biological sample may be a cellular sample. The biological sample may be a cell-free (or cell free) sample, such as cell-free DNA or RNA. The biological sample may comprise cancer cells, non-cancer cells, immune cells, non-immune cells, or any combination thereof. The biological sample may be a tissue sample. The biological sample may be a liquid sample. The liquid sample can be a cancer or non-cancer sample. Non-limiting examples of liquid biological samples include synovial fluid, whole blood, blood plasma, lymph, bone marrow, cerebrospinal fluid, serum, seminal fluid, urine, and amniotic fluid.
[0082] The term “variant,” as used herein, generally refers to a genetic variant, such as an alteration, variant or polymorphism in a nucleic acid sample or genome of a subject. Such alteration, variant or polymorphism can be with respect to a reference genome, which may be a reference genome of the subject or other individual. Single nucleotide polymorphisms (SNPs) are a form of polymorphisms. In some examples, one or more polymorphisms comprise one or more single nucleotide variations (SNVs), insertions, deletions, repeats, small insertions, small deletions, small repeats, structural variant junctions, variable length tandem repeats, and/or flanking sequences. Copy number variants (CNVs), transversions and other rearrangements are also forms of genetic variation. A genomic alternation may be a base change, insertion, deletion, repeat, copy number variation, or transversion.
[0083] The term “subject,” as used herein, generally refers to an animal, such as a mammalian species (e.g., human) or avian (e.g., bird) species, or other organism, such as a plant. More specifically, the subject can be a vertebrate, a mammal, a mouse, a primate, a simian or a human. Animals include, but are not limited to, farm animals, sport animals, and pets. The subject can be a healthy individual, an individual that has or is suspected of having a disease or a predisposition to the disease, or an individual that is in need of therapy or suspected of needing therapy. The subject can be a patient. The subject may have or be suspected of having a disease.
Computing system
[0084] Referring to FIG. 4, a block diagram is shown depicting an exemplary machine that includes a computer system 400 (e.g., a processing or computing system) within which a set of instructions can execute for causing a device to perform or execute any one or more of the aspects and/or methodologies for static code scheduling of the present disclosure. The components in FIG. 4 are examples only and do not limit the scope of use or functionality of any hardware, software, embedded logic component, or a combination of two or more such components implementing particular embodiments.
[0085] Computer system 400 may include one or more processors 401, a memory 403, and a storage 408 that communicate with each other, and with other components, via a bus 440. The bus 440 may also link a display 432, one or more input devices 433 (which may, for example, include a keypad, a keyboard, a mouse, a stylus, etc.), one or more output devices 434, one or more storage devices 435, and various tangible storage media 436. All of these elements may interface directly or via one or more interfaces or adaptors to the bus 440. For instance, the various tangible storage media 436 can interface with the bus 440 via storage medium interface 426. Computer system 400 may have any suitable physical form, including but not limited to one or more integrated circuits (ICs), printed circuit boards (PCBs), mobile handheld devices (such as mobile telephones or PDAs), laptop or notebook computers, distributed computer systems, computing grids, or servers.
[0086] Computer system 400 includes one or more processor(s) 401 (e.g., central processing units (CPUs) or general purpose graphics processing units (GPGPUs)) that carry out functions. Processor(s) 401 optionally contains a cache memory unit 402 for temporary local storage of instructions, data, or computer addresses. Processor(s) 401 are configured to assist in execution of computer readable instructions. Computer system 400 may provide functionality for the components depicted in FIG. 4 as a result of the processor(s) 401 executing non-transitory, processor-executable instructions embodied in one or more tangible computer-readable storage media, such as memory 403, storage 408, storage devices 435, and/or storage medium 436. The computer-readable media may store software that implements particular embodiments, and processor(s) 401 may execute the software. Memory 403 may read the software from one or more other computer-readable media (such as mass storage device(s) 435, 436) or from one or more other sources through a suitable interface, such as network interface 420. The software may cause processor(s) 401 to carry out one or more processes or one or more steps of one or more processes described or illustrated herein. Carrying out such processes or steps may include
defining data structures stored in memory 403 and modifying the data structures as directed by the software.
[0087] The memory 403 may include various components (e.g., machine readable media) including, but not limited to, a random access memory component (e.g., RAM 404) (e.g., static RAM (SRAM), dynamic RAM (DRAM), ferroelectric random access memory (FRAM), phasechange random access memory (PRAM), etc.), a read-only memory component (e.g., ROM 405), and any combinations thereof. ROM 405 may act to communicate data and instructions unidirectionally to processor(s) 401, and RAM 404 may act to communicate data and instructions bidirectionally with processor(s) 401. ROM 405 and RAM 404 may include any suitable tangible computer-readable media described below. In one example, a basic input/output system 406 (BIOS), including basic routines that help to transfer information between elements within computer system 400, such as during start-up, may be stored in the memory 403.
[0088] Fixed storage 408 is connected bidirectionally to processor(s) 401, optionally through storage control unit 407. Fixed storage 408 provides additional data storage capacity and may also include any suitable tangible computer-readable media described herein. Storage 408 may be used to store operating system 409, executable(s) 410, data 411, applications 412 (application programs), and the like. Storage 408 can also include an optical disk drive, a solid-state memory device (e.g., flash-based systems), or a combination of any of the above. Information in storage 408 may, in appropriate cases, be incorporated as virtual memory in memory 403.
[0089] In one example, storage device(s) 435 may be removably interfaced with computer system 400 (e.g., via an external port connector (not shown)) via a storage device interface 425. Particularly, storage device(s) 435 and an associated machine-readable medium may provide non-volatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the computer system 400. In one example, software may reside, completely or partially, within a machine-readable medium on storage device(s) 435. In another example, software may reside, completely or partially, within processor(s) 401.
[0090] Bus 440 connects a wide variety of subsystems. Herein, reference to a bus may encompass one or more digital signal lines serving a common function, where appropriate. Bus 440 may be any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combinations thereof, using any of a variety of bus architectures. As an example, and not by way of limitation, such architectures include an Industry Standard Architecture (ISA) bus, an Enhanced ISA (EISA) bus, a Micro Channel Architecture (MCA) bus, a Video Electronics Standards Association local bus (VLB), a
Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, an Accelerated Graphics Port (AGP) bus, HyperTransport (HTX) bus, serial advanced technology attachment (SATA) bus, and any combinations thereof.
[0091] Computer system 400 may also include an input device 433. In one example, a user of computer system 400 may enter commands and/or other information into computer system 400 via input device(s) 433. Examples of an input device(s) 433 include, but are not limited to, an alpha-numeric input device (e.g., a keyboard), a pointing device (e.g., a mouse or touchpad), a touchpad, a touch screen, a multi-touch screen, a joystick, a stylus, a gamepad, an audio input device (e.g., a microphone, a voice response system, etc.), an optical scanner, a video or still image capture device (e.g., a camera), and any combinations thereof. In some embodiments, the input device is a Kinect, Leap Motion, or the like. Input device(s) 433 may be interfaced to bus 440 via any of a variety of input interfaces 423 (e.g., input interface 423) including, but not limited to, serial, parallel, game port, USB, FIREWIRE, THUNDERBOLT, or any combination of the above.
[0092] In particular embodiments, when computer system 400 is connected to network 430, computer system 400 may communicate with other devices, specifically mobile devices and enterprise systems, distributed computing systems, cloud storage systems, cloud computing systems, and the like, connected to network 430. Communications to and from computer system 400 may be sent through network interface 420. For example, network interface 420 may receive incoming communications (such as requests or responses from other devices) in the form of one or more packets (such as Internet Protocol (IP) packets) from network 430, and computer system 400 may store the incoming communications in memory 403 for processing. Computer system 400 may similarly store outgoing communications (such as requests or responses to other devices) in the form of one or more packets in memory 403 and communicated to network 430 from network interface 420. Processor(s) 401 may access these communication packets stored in memory 403 for processing.
[0093] Examples of the network interface 420 include, but are not limited to, a network interface card, a modem, and any combination thereof. Examples of a network 430 or network segment 430 include, but are not limited to, a distributed computing system, a cloud computing system, a wide area network (WAN) (e.g., the Internet, an enterprise network), a local area network (LAN) (e.g., a network associated with an office, a building, a campus or other relatively small geographic space), a telephone network, a direct connection between two computing devices, a peer-to-peer network, and any combinations thereof. A network, such as network 430, may
employ a wired and/or a wireless mode of communication. In general, any network topology may be used.
[0094] Information and data can be displayed through a display 432. Examples of a display 432 include, but are not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a thin film transistor liquid crystal display (TFT-LCD), an organic liquid crystal display (OLED) such as a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display, a plasma display, and any combinations thereof. The display 432 can interface to the processor(s) 401, memory 403, and fixed storage 408, as well as other devices, such as input device(s) 433, via the bus 440. The display 432 is linked to the bus 440 via a video interface 422, and transport of data between the display 432 and the bus 440 can be controlled via the graphics control 421. In some embodiments, the display is a video projector. In some embodiments, the display is a headmounted display (HMD) such as a VR headset. In further embodiments, suitable VR headsets include, by way of non-limiting examples, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like. In still further embodiments, the display is a combination of devices such as those disclosed herein.
[0095] In addition to a display 432, computer system 400 may include one or more other peripheral output devices 434 including, but not limited to, an audio speaker, a printer, a storage device, and any combinations thereof. Such peripheral output devices may be connected to the bus 440 via an output interface 424. Examples of an output interface 424 include, but are not limited to, a serial port, a parallel connection, a USB port, a FIREWIRE port, a THUNDERBOLT port, and any combinations thereof.
[0096] In addition, or as an alternative, computer system 400 may provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which may operate in place of or together with software to execute one or more processes or one or more steps of one or more processes described or illustrated herein. Reference to software in this disclosure may encompass logic, and reference to logic may encompass software. Moreover, reference to a computer-readable medium may encompass a circuit (such as an IC) storing software for execution, a circuit embodying logic for execution, or both, where appropriate. The present disclosure encompasses any suitable combination of hardware, software, or both.
[0097] Those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative
components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality.
[0098] The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
[0099] The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by one or more processor(s), or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
[0100] In accordance with the description herein, suitable computing devices include, by way of non-limiting examples, server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, personal digital assistants, video game consoles, and vehicles. Those of skill in the art will also recognize that select televisions, video players, and digital music players with optional computer network connectivity are suitable for use in the system described herein. Suitable tablet computers, in various embodiments, include those with booklet, slate, and convertible configurations, known to those of skill in the art.
[0101] In some embodiments, the computing device includes an operating system configured to perform executable instructions. The operating system is, for example, software, including
programs and data, which manages the device’s hardware and provides services for execution of applications. Those of skill in the art will recognize that suitable server operating systems include, by way of non -limiting examples, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Those of skill in the art will recognize that suitable personal computer operating systems include, by way of non-limiting examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some embodiments, the operating system is provided by cloud computing. Those of skill in the art will also recognize that suitable mobile smartphone operating systems include, by way of non-limiting examples, Nokia® Symbian® OS, Apple® iOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®. Those of skill in the art will also recognize that suitable media streaming device operating systems include, by way of non-limiting examples, Apple TV®, Roku®, Boxee®, Google TV®, Google Chromecast®, Amazon Fire®, and Samsung® HomeSync®. Those of skill in the art will also recognize that suitable video game console operating systems include, by way of nonlimiting examples, Sony® PS3®, Sony® PS4®, Microsoft® Xbox 360®, Microsoft Xbox One, Nintendo® Wii®, Nintendo® Wii U®, and Ouya®.
Non-transitory computer readable storage medium
[0102] In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked computing device. In further embodiments, a computer readable storage medium is a tangible component of a computing device. In still further embodiments, a computer readable storage medium is optionally removable from a computing device. In some embodiments, a computer readable storage medium includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, distributed computing systems including cloud computing systems and services, and the like. In some cases, the program and instructions are permanently, substantially permanently, semipermanently, or non-transitorily encoded on the media.
Computer program
[0103] In some embodiments, the platforms, systems, media, and methods disclosed herein include at least one computer program, or use of the same. A computer program includes a sequence of instructions, executable by one or more processor(s) of the computing device’s
CPU, written to perform a specified task. Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), computing data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages. [0104] The functionality of the computer readable instructions may be combined or distributed as desired in various environments. In some embodiments, a computer program comprises one sequence of instructions. In some embodiments, a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules. In various embodiments, a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.
Software Modules
[0105] In some embodiments, the platforms, systems, media, and methods disclosed herein include software, server, and/or database modules, or use of the same. In view of the disclosure provided herein, software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art. The software modules disclosed herein are implemented in a multitude of ways. In various embodiments, a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof. In further various embodiments, a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof. In various embodiments, the one or more software modules comprise, by way of non-limiting examples, a web application, a mobile application, and a standalone application. In some embodiments, software modules are in one computer program or application. In other embodiments, software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on a distributed computing platform such as a cloud computing platform. In some embodiments, software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location.
Databases
[0106] In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same. In view of the disclosure provided herein, those of skill in the art will recognize that many databases are suitable for storage and retrieval of information. In various embodiments, suitable databases include, by way of non-limiting examples, relational databases, non-relational databases, object oriented databases, object databases, entity -relationship model databases, associative databases, and XML databases. Further non-limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, and Sybase. In some embodiments, a database is internet-based. In further embodiments, a database is webbased. In still further embodiments, a database is cloud computing-based. In a particular embodiment, a database is a distributed database. In other embodiments, a database is based on one or more local computer storage devices.
Machine Learning
[0107] In some embodiments, machine learning algorithms are utilized to generate a trained model or classifier configured to process input data comprising a plurality of features and generate an output indicative of a predicted outcome or classification. The plurality of features may include scores based on gene sets, for example, GSEA gene set enrichment scores, although metrics calculated based on gene sets are also contemplated.
[0108] In some embodiments, the machine learning algorithms herein employ one or more forms of labels including but not limited to human annotated labels and semi-supervised labels. The labels can be indicative of treatment outcomes for cancer patients. In particular, the labels may be indicative of response to immunotherapies. Examples of labels includes complete response, partial response, stable disease, and progressive disease as measures of efficacy of a therapeutic intervention for a disease such as cancer.
[0109] In some embodiments, the machine learning algorithm utilizes regression modeling, wherein relationships between predictor variables and dependent variables are determined and weighted. In one embodiment, for example, the predicted outcome (e.g., responsiveness to an immunotherapy) is a dependent variable and is derived from a plurality of biological features such as GSEA enrichment scores.
[0110] Examples of machine learning algorithms can include a support vector machine (SVM), a naive Bayes classification, a random forest, a neural network, deep learning, principal component analysis (PCA), or other supervised learning algorithm or unsupervised learning algorithm for classification and regression. The machine learning algorithms can be trained using one or more training datasets.
[OHl] In some embodiments, a machine learning algorithm uses a supervised learning approach. In supervised learning, the algorithm generates a function from labeled training data. Each training example is a pair consisting of an input object and a desired output value. In some embodiments, an optimal scenario allows for the algorithm to correctly determine the class labels for unseen instances. In some embodiments, a supervised learning algorithm requires the user to determine one or more control parameters. These parameters are optionally adjusted by optimizing performance on a subset, called a validation set, of the training set. After parameter adjustment and learning, the performance of the resulting function is optionally measured on a test set that is separate from the training set. Regression methods are commonly used in supervised learning. Accordingly, supervised learning allows for a model or classifier to be generated or trained with training data in which the expected output is known in advance such as when the ground truth location for a communication is known.
[0112] In some embodiments, a machine learning algorithm uses an unsupervised learning approach. In unsupervised learning, the algorithm generates a function to describe hidden structures from unlabeled data (e.g., a classification or categorization is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm. Approaches to unsupervised learning include: clustering, anomaly detection, and neural networks.
[0113] In some embodiments, a machine learning algorithm uses a semi-supervised learning approach. Semi-supervised learning combines both labeled and unlabeled data to generate an appropriate function or classifier. Semi -supervised learning is usually used in data augmentation. [0114] In some embodiments, a machine learning algorithm uses a reinforcement learning approach. In reinforcement learning, the algorithm learns a policy of how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
[0115] In some embodiments, a machine learning algorithm learns in batches based on the training dataset and other inputs for that batch. In other embodiments, the machine learning algorithm performs on-line learning where the weights and error calculations are constantly updated.
[0116] In some embodiments, a machine learning algorithm uses a transduction approach. Transduction is similar to supervised learning but does not explicitly construct a function. Instead, tries to predict new outputs based on training inputs, training outputs, and new inputs. [0117] In some embodiments, a machine learning algorithm uses a “learning to learn” approach. In learning to learn, the algorithm learns its own inductive bias based on previous experience.
[0118] In some embodiments, a machine learning algorithm is applied to new or updated emergency data to be re-trained to generate a new prediction model. In some embodiments, a machine learning algorithm or model is re-trained periodically. In some embodiments, a machine learning algorithm or model is re-trained non-periodically. In some embodiments, a machine learning algorithm or model is re-trained at least once a day, a week, a month, or a year or more. In some embodiments, a machine learning algorithm or model is re-trained at least once every 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 days or more.
[0119] In some embodiments, a machine learning algorithm is provided with unlabeled or unclassified data for unsupervised learning, which leaves the algorithm to identify hidden structure amongst the cases (e.g., clustering). In some embodiments, unsupervised learning is used to identify the representations that are most useful for classifying raw data (e.g., identifying features that help separate subjects into separate cohorts that may be analyzed using different models and/or evaluated with different thresholds or rules). For example, unsupervised learning is capable of identifying hidden patterns such as relationships between certain features from the data in the knowledge base that would not be readily apparent to a human.
[0120] In some embodiments, one or more sets of training data are generated and provided to a computer-implemented system comprising one or more algorithms for making predictions. In some embodiments, an algorithm utilizes a predictive model such as a neural network, a decision tree, a support vector machine, or other applicable model. Using the training data, an algorithm is able to form a classifier for generating a classification or prediction according to relevant features. The features selected for classification can be classified using a variety of viable methods. In some embodiments, the trained algorithm comprises a machine learning algorithm. In some embodiments, the machine learning algorithm is selected from at least one of a supervised, semi -supervised and unsupervised learning, such as, for example, a support vector machine (SVM), a Naive Bayes classification, a random forest, an artificial neural network, a decision tree, a K-means, learning vector quantization (LVQ), regression algorithm (e.g., linear, logistic, multivariate), association rule learning, deep learning, dimensionality reduction and ensemble selection algorithms. In some embodiments, the machine learning algorithm is a support vector machine (SVM), a Naive Bayes classification, a random forest, or an artificial neural network. Machine learning techniques include bagging procedures, boosting procedures, random forest algorithms, and combinations thereof.
EXAMPLES
[0121] The following illustrative examples are representative of embodiments of the software applications, systems, and methods described herein and are not meant to be limiting in any way.
Example 1
[0122] Tumor samples were obtained from subjects having HNSCC (Adkins), bladder cancer, and melanoma. RNA extraction was performed on the tumor samples and used for subsequent library generation using the Lexogen QuantSeq 3’ mRNA-Seq library Prep Kit FWD for Illumina. The mRNA library was subjected to next generation sequencing using the Illumina NextSeq sequencing platform to generate gene expression data. Single-sample gene set enrichment analysis (ssGSEA) was conducted according to gene sets derived from MSigDB, including KEGG and BioCarta. The 24 gene sets listed in Table 2 were subjected to GSEA to determine scores for each of the gene sets.
[0123] The ssGSEA analysis produced a set of 24 enrichment scores for the 24 corresponding gene sets for the HNSCC tumor samples. These 24 enrichment scores of the tumor samples were used to train a machine learning model using linear principal component analysis (PCA) and support vector machine (SVM) methods in order to predict objective response and survival.
[0124] The trained model (the “ssGSEA biomarker model”) was then evaluated for ability to predict treatment outcome. As shown in FIG. 1, the model was evaluated using an Out Of Bag Receiver Operating Characteristic (OOB ROC) analysis, which is a way to estimate model performance on untrained datasets. The Area Under the Curve (AUC) of the ROC curve for the model was 0.85, indicating that the model performs well (high true positive rate and low false positive rate) at predicting treatment outcome.
[0125] FIG. 2 is a plot showing the mean scores of individual samples in the training set (on average across OOB samplings). These data shows a 96% negative predictive value (NPV) and 93% sensitivity (SN).
[0126] If the treat or no-treat decision was based on the median score being used as the demarcation (e.g., if a patient sample’s score is below the median score, the patient will not receive a treatment, and if a patient sample’s score is above the median score, the patient will receive a treatment), the ssGSEA biomarker model applied to the training set has the performance shown in Table 4.
Table 4
[0127] Physicians can use the ssGSEA biomarker model in future clinical decision making by considering the disease control rate (DCR). The DCR is the percentage of patients who had a treatment response (e.g, patients who achieved complete response, partial response, or stable disease to treatment) and is similar to “likelihood of response”. Here, the DCR of HNSCC patients in response to immune-oncology (I/O) treatment is considered. As shown in FIG. 3, the output scores were grouped into four quartiles QI, Q2, Q3, and Q4, with QI having the lowest 25% of scores and Q4 having the highest 25% of scores. The lower the score, the lower the anticipated benefit of the drug, as evidenced by the correlation between quartile and DCR. In FIG. 3, the QI and Q2 divisions show a low DCR (less than 10%), whereas Q3 and Q4 have a high DCR (greater than about 40%).
[0128] The expected DCR in response to I/O treatment for HNSCC patients is about 30%. Therefore, if a patient’s sample has a high score and the score falls into Q3 or Q4, physicians may recommend I/O treatment, as HNSCC patients in these categories have a DCR in response to I/O of greater than about 40%.
[0129] As compared to models that use features corresponding to individual genetic biomarkers, this approach of using gene sets has demonstrated surprisingly accurate performance across multiple cancer types such as HNSCC. An HNSCC ssGSEA biomarker model achieved superior results compared to a clinically used biomarker PD-L1 model (FIG. 6).
[0130] When compared to other literature methods, the instant methods perform as well or better as shown in Table 1. Moreover, this technique is effective across multiple cancer types. The results are shown in Table 5.
Table 5
[0131] While preferred embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure.
EMBODIMENTS
[0132] In some cases, the present disclosure provides a method according to the following embodiments:
[0133] Embodiment 1. A method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising: obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets linked to the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets linked to the plurality of biological features, thereby generating an output; and generating a determination indicative of a treatment outcome based on the output.
[0134] Embodiment 2. The method of embodiment 1, wherein the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
[0135] Embodiment 3. The method of embodiment 2, wherein the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, interferon gamma, antigen presentation, T-cell exhaustion, or any combination thereof.
[0136] Embodiment 4. The method of embodiment 2, wherein the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1.
[0137] Embodiment 5. The method of embodiment 2, wherein the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets from a molecular signature database (MSigDB).
[0138] Embodiment 6. The method of embodiment 5, wherein the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
[0139] Embodiment 7. The method of embodiment 1, wherein the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
[0140] Embodiment 8. The method of embodiment 1, further comprising obtaining the biological sample of said subject.
[0141] Embodiment 9. The method of embodiment 8, wherein said biological sample is a solid tumor or liquid biopsy.
[0142] Embodiment 10. The method of embodiment 8, wherein said biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
[0143] Embodiment 11. The method of embodiment 8, wherein said biological sample comprises cancer tissue.
[0144] Embodiment 12. The method of embodiment 11, wherein said cancer tissue comprises tumor-infiltrating immune cells.
[0145] Embodiment 13. The method of embodiment 11, wherein said biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
[0146] Embodiment 14. The method of embodiment 1, further comprising processing said biological sample to prevent or inhibit tissue degradation.
[0147] Embodiment 15. The method of embodiment 14, wherein said biological sample is processed into a formalin-fixed paraffin-embedded sample.
[0148] Embodiment 16. The method of embodiment 1, further comprising extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data.
[0149] Embodiment 17. The method of embodiment 16, wherein said RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data,
[0150] Embodiment 18. The method of embodiment 1, wherein said disease or condition is cancer
[0151] Embodiment 19. The method of embodiment 18, wherein said cancer is a solid cancer or a hematopoietic cancer.
[0152] Embodiment 20. The method of embodiment 18, wherein said cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
[0153] Embodiment 21. The method of embodiment 20, further comprising selecting said subject for prediction of said treatment outcome based on said status.
[0154] Embodiment 22. The method of embodiment 21, wherein said treatment outcome corresponds to one or more cancer treatments.
[0155] Embodiment 23. The method of embodiment 22, wherein said one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
[0156] Embodiment 24. The method of embodiment 22, wherein said subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
[0157] Embodiment 25. The method of embodiment 24, wherein said subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
[0158] Embodiment 26. A method for generating a trained machine learning model configured to generate a prediction of treatment outcome, comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
[0159] Embodiment 27. The method of embodiment 26, wherein said plurality of biological samples are obtained from said subjects prior to receiving said treatment and said subjects are classified according to said treatment outcome after receiving said treatment.
[0160] Embodiment 28. The method of embodiment 26, further comprising configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
[0161] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way
of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
Claims
1. A method for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising: obtaining gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features; conducting a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; processing, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and generating a determination indicative of the treatment outcome based on the output.
2. The method of claim 1, wherein the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
3. The method of claim 1, wherein the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
4. The method of claim 1, wherein the plurality of gene sets comprises one, two, three, four, five, or six gene sets listed in Table 1.
5. The method of claim 1, wherein the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
6. The method of claim 5, wherein the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
7. The method of claim 1, wherein the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database.
8. The method of claim 7, wherein the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
9. The method of any one of claims 1 to 8, wherein the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
10. The method of any one of claims 1 to 9, further comprising obtaining the biological sample of said subject.
11. The method of claim 10, wherein said biological sample is a solid tumor or liquid biopsy.
12. The method of claim 10, wherein said biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
13. The method of claim 10, wherein said biological sample comprises cancer tissue.
14. The method of claim 13, wherein said cancer tissue comprises tumor-infiltrating immune cells.
15. The method of claim 13, wherein said biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
16. The method of any one of claims 1 to 15, further comprising processing said biological sample to prevent or inhibit tissue degradation.
17. The method of claim 16, wherein said biological sample is processed into a formalin-fixed paraffin-embedded sample.
18. The method of any one of claims 1 to 17, further comprising extracting RNA from said biological sample, generating an RNA library from said extracted RNA, and performing RNA-Seq on the RNA library to generate said gene expression data.
19. The method of claim 18, wherein said RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
20. The method of claim 1, wherein said disease or condition is cancer.
21. The method of claim 20, wherein said cancer is a solid cancer or a hematopoietic cancer.
22. The method of claim 20, wherein said cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
23. The method of claim 22, further comprising selecting said subject for prediction of said treatment outcome based on said status.
24. The method of claim 23, wherein said treatment outcome corresponds to one or more cancer treatments.
25. The method of claim 24, wherein said one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
26. The method of claim 24, wherein said subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
27. The method of claim 26, wherein said subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
28. The method of claim 27, further comprising selecting said subject for generating said determination indicative of said treatment outcome based on a current status of said disease or condition.
29. The method of any one of claims 1 to 28, wherein said subject is treated based at least on said determination indicative of said treatment outcome.
30. The method of any one of claims 1 to 29, wherein said subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
31. A computer-implemented system for analyzing a biological sample obtained from a subject having or suspected of having a disease or condition, comprising a processor and non- transitory computer readable storage medium comprising instructions that, when executed by the processor, cause the processor to: obtain gene expression data corresponding to a plurality of gene sets linked to a plurality of biological features;
conduct a differential gene set enrichment analysis on the gene expression data to generate a plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features; process, using a machine learning model, the plurality of metrics corresponding to the plurality of gene sets associated with the plurality of biological features, thereby generating an output; wherein the output comprises a trained machine learning model configured to predict a treatment outcome; and generate a determination indicative of the treatment outcome based on the output.
32. The system of claim 31, wherein the plurality of biological features comprises a biological process, gene ontology (GO), molecular function, or molecular pathway.
33. The system of claim 31 or 32, wherein the plurality of biological features comprises gene signatures corresponding to inflammation, cytotoxicity, immune cell infiltration, interferon gamma signaling, antigen presentation, T-cell exhaustion, or any combination thereof.
34. The system of any one of claims 31 to 33, wherein the plurality of gene sets comprises one, two, three, four, five, or six gene set listed in Table 1.
35. The system of any one of claims 31 to 34, wherein the plurality of gene sets comprises at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 gene sets listed in Table 2.
36. The system of any one of claims 31 to 35, wherein the gene set enrichment analysis uses no more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 95% of genes listed for one or more of the plurality of gene sets in Table 1.
37. The system of any one of claims 31 to 36, wherein the plurality of gene sets comprises at least five, ten, fifteen, twenty, or twenty-five gene sets obtained or derived from a molecular signature database.
38. The system of any one of claims 31 to 37, wherein the plurality of gene sets comprises one or more gene sets from at least one of the following gene set collections: hallmark gene, positional gene sets, curated gene sets, regulatory target gene sets, computational gene sets, ontology gene sets, oncogenic signature gene sets, immunologic signature gene sets, or cell type signature gene sets.
39. The system of any one of claims 31 to 38, wherein the gene expression data comprises next generation sequencing data, microarray expression data, or quantitative PCR data.
40. The system of any one of claims 31 to 39, wherein the processor is configured to obtain the gene expression data for the biological sample of said subject from a database.
41. The system of any one of claims 31 to 40, wherein said biological sample is a solid tumor or liquid biopsy.
42. The system of any one of claims 31 to 41, wherein said biological sample comprises blood, serum, plasma, lymph, urine, saliva, tears, cerebrospinal fluid, amniotic fluid, bile, ascites fluid, or organ or tissue sample.
43. The system of any one of claims 31 to 42, wherein said biological sample comprises cancer tissue.
44. The system of claim 43, wherein said cancer tissue comprises tumor-infiltrating immune cells.
45. The system of claim 43, wherein said biological sample is a mixed sample comprising said cancer tissue and non-cancer cells.
46. The system of any one of claims 43 to 45, wherein said biological sample is processed to prevent or inhibit tissue degradation.
47. The system of any one of claims 31 to 46, wherein said biological sample is processed into a formalin-fixed paraffin-embedded sample.
48. The system of any one of claims 31 to 47, wherein the RNA is extracted from said biological sample, an RNA library is generated from said extracted RNA, and RNA-Seq is performed on the RNA library to generate said gene expression data.
49. The system of claim 48, wherein said RNA library is an mRNA library and said gene expression data comprises transcriptome sequencing data.
50. The system of any one of claims 31 to 49, wherein said disease or condition is cancer.
51. The system of claim 50, wherein said cancer is a solid cancer or a hematopoietic cancer.
52. The system of claim 50, wherein said cancer comprises a status selected from a stage I cancer, stage II cancer, stage III cancer, stage IV cancer, in remission, or a refractory or resistant cancer.
53. The system of claim 52, wherein said subject is selected for prediction of said treatment outcome based on said status.
54. The system of claim 53, wherein said treatment outcome corresponds to one or more cancer treatments.
55. The system of claim 54, wherein said one or more cancer treatments comprises at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
56. The system of any one of claims 1 to 55, wherein said subject has undergone at least one of surgery, chemotherapy, radiation therapy, targeted therapy, immunotherapy, stem cell or bone marrow transplant, or hormone therapy.
57. The system of claim 56, wherein said subject is evaluated for prediction of treatment outcome for an alternative therapy based on cancer recurrence following an earlier treatment, an adverse reaction causing discontinuation of an earlier treatment, or lack of response or partial response to an earlier treatment.
58. The system of any one of claims 31 to 57, wherein said subject is selected for evaluation to generate said determination indicative of said treatment outcome based on a current status of said disease or condition.
59. The system of any one of claims 31 to 58, wherein said subject is treated based at least on said determination indicative of said treatment outcome.
60. The system of any one of claims 31 to 59, wherein said subject undergoes a new treatment, discontinues a current treatment, modifies said current treatment, replaces said current treatment with said new treatment, or undergoes said new treatment in addition said current treatment based at least on said determination indicative of said treatment outcome.
61. A method for generating a trained machine learning model configured to generate a prediction of treatment outcome, comprising: processing a plurality of biological samples obtained from subjects in order to generate a plurality of RNA libraries, wherein each of said subjects have a classification corresponding to treatment outcome; performing sequencing on said plurality of RNA libraries to generate a plurality of gene expression data sets for said plurality of biological samples; conducting gene set enrichment analysis on each of said plurality of gene expression data sets to generate a plurality of metrics corresponding to a plurality of gene sets linked to a plurality of biological features; and
training a model, using a machine learning algorithm, with a training data set comprising said classification corresponding to treatment outcome and said plurality of metrics corresponding to said plurality of gene sets linked to said plurality of biological features for each of said plurality of gene expression data sets, thereby generating a trained machine learning model configured to predict treatment outcome.
62. The method of claim 61, wherein said plurality of biological samples are obtained from said subjects prior to receiving said treatment and said subjects are classified according to said treatment outcome after receiving said treatment.
63. The method of claim 61 or 62, further comprising configuring a computing system with a software module comprising said trained machine learning model, wherein said software module is configured to process input gene expression data using said trained machine learning model to generate predictions of treatment outcome.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263346718P | 2022-05-27 | 2022-05-27 | |
US63/346,718 | 2022-05-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023230321A1 true WO2023230321A1 (en) | 2023-11-30 |
Family
ID=88919965
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/023681 WO2023230321A1 (en) | 2022-05-27 | 2023-05-26 | Machine learning systems and methods for gene set enrichment analysis and scoring |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023230321A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006103442A2 (en) * | 2005-04-01 | 2006-10-05 | Ncc Technology Ventures Pte. Ltd. | Materials and methods relating to breast cancer classification |
WO2019109089A1 (en) * | 2017-12-01 | 2019-06-06 | Illumina, Inc. | Systems and methods for assessing drug efficacy |
JP2020178667A (en) * | 2019-04-26 | 2020-11-05 | 国立大学法人 東京大学 | Prediction method of effect and prognosis of cancer treatment, and selection method of treatment means |
WO2021092224A1 (en) * | 2019-11-05 | 2021-05-14 | Cofactor Genomics, Inc. | Methods and systems of processing complex data sets using artificial intelligence and deconvolution |
-
2023
- 2023-05-26 WO PCT/US2023/023681 patent/WO2023230321A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006103442A2 (en) * | 2005-04-01 | 2006-10-05 | Ncc Technology Ventures Pte. Ltd. | Materials and methods relating to breast cancer classification |
WO2019109089A1 (en) * | 2017-12-01 | 2019-06-06 | Illumina, Inc. | Systems and methods for assessing drug efficacy |
JP2020178667A (en) * | 2019-04-26 | 2020-11-05 | 国立大学法人 東京大学 | Prediction method of effect and prognosis of cancer treatment, and selection method of treatment means |
WO2021092224A1 (en) * | 2019-11-05 | 2021-05-14 | Cofactor Genomics, Inc. | Methods and systems of processing complex data sets using artificial intelligence and deconvolution |
Non-Patent Citations (1)
Title |
---|
ZALA, J. ET AL.: "Ranking metrics in gene set enrichment analysis : do they matter", BMC BIOINFORMATICS, vol. 18, 2017, pages 1 - 12, XP021244998, DOI: 10.1186/s12859-017-1674-0 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xiao et al. | Prediction of lncRNA-protein interactions using HeteSim scores based on heterogeneous networks | |
Rodin et al. | The landscape of somatic mutation in cerebral cortex of autistic and neurotypical individuals revealed by ultra-deep whole-genome sequencing | |
DeBoever et al. | Large-scale profiling reveals the influence of genetic variation on gene expression in human induced pluripotent stem cells | |
JP6987786B2 (en) | Detection and diagnosis of cancer evolution | |
US11640405B2 (en) | Methods for analyzing genotypes | |
JP7394169B2 (en) | Method and system for detecting common interstitial pneumonia | |
Desvignes et al. | miRNA analysis with Prost! reveals evolutionary conservation of organ-enriched expression and post-transcriptional modifications in three-spined stickleback and zebrafish | |
Park et al. | Exome-wide evaluation of rare coding variants using electronic health records identifies new gene–phenotype associations | |
WO2018223066A1 (en) | Methods and systems for identifying or monitoring lung disease | |
CN113228190A (en) | Tumor classification based on predicted tumor mutation burden | |
Strunz et al. | A mega-analysis of expression quantitative trait loci in retinal tissue | |
CN109563544A (en) | The diagnostic assay of urine monitoring for bladder cancer | |
WO2020028989A1 (en) | Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection | |
US20230160019A1 (en) | Rna markers and methods for identifying colon cell proliferative disorders | |
US20230348980A1 (en) | Systems and methods of detecting a risk of alzheimer's disease using a circulating-free mrna profiling assay | |
Li et al. | De novo transcriptome sequencing and analysis of male, pseudo-male and female yellow perch, Perca flavescens | |
EP3743518A1 (en) | Methods and systems for abnormality detection in the patterns of nucleic acids | |
Chiou et al. | Multiregion transcriptomic profiling of the primate brain reveals signatures of aging and the social environment | |
Rodin et al. | The landscape of mutational mosaicism in autistic and normal human cerebral cortex | |
WO2023230321A1 (en) | Machine learning systems and methods for gene set enrichment analysis and scoring | |
CN114627970A (en) | Prognosis model of scorching-related lncRNA of colon adenocarcinoma and construction method and application thereof | |
Fischer et al. | Genome sequences of Tropheus moorii and Petrochromis trewavasae, two eco-morphologically divergent cichlid fishes endemic to Lake Tanganyika | |
Smits et al. | Multi-omics analyses identify transcription factor interplay in corneal epithelial fate determination and disease | |
Yapar et al. | Convergent evolution of primate testis transcriptomes reflects mating strategy | |
Liu | Accurate, Systematic and Integrated Inference of Omics Data Using Novel Bioinformatics Approaches |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23812625 Country of ref document: EP Kind code of ref document: A1 |