WO2024081737A1 - Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement - Google Patents
Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement Download PDFInfo
- Publication number
- WO2024081737A1 WO2024081737A1 PCT/US2023/076606 US2023076606W WO2024081737A1 WO 2024081737 A1 WO2024081737 A1 WO 2024081737A1 US 2023076606 W US2023076606 W US 2023076606W WO 2024081737 A1 WO2024081737 A1 WO 2024081737A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- machine learning
- sequence data
- treatment
- feature set
- learning models
- Prior art date
Links
- 238000010801 machine learning Methods 0.000 title claims abstract description 155
- 230000004044 response Effects 0.000 title claims abstract description 76
- 238000011282 treatment Methods 0.000 title claims abstract description 76
- 239000000090 biomarker Substances 0.000 title claims abstract description 74
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 49
- 238000000034 method Methods 0.000 claims abstract description 46
- 201000010099 disease Diseases 0.000 claims abstract description 34
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 34
- 238000000638 solvent extraction Methods 0.000 claims abstract description 10
- 206010039073 rheumatoid arthritis Diseases 0.000 claims description 28
- 239000003795 chemical substances by application Substances 0.000 claims description 18
- 238000002790 cross-validation Methods 0.000 claims description 15
- 239000002988 disease modifying antirheumatic drug Substances 0.000 claims description 15
- 239000003112 inhibitor Substances 0.000 claims description 15
- 230000015654 memory Effects 0.000 claims description 12
- 102000015617 Janus Kinases Human genes 0.000 claims description 10
- 108010024121 Janus Kinases Proteins 0.000 claims description 10
- 239000003435 antirheumatic agent Substances 0.000 claims description 9
- 229920002477 rna polymer Polymers 0.000 claims description 9
- 229960003989 tocilizumab Drugs 0.000 claims description 9
- 229940123907 Disease modifying antirheumatic drug Drugs 0.000 claims description 8
- 229960004641 rituximab Drugs 0.000 claims description 7
- 230000014509 gene expression Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- WYQFJHHDOKWSHR-MNOVXSKESA-N (3S,4R)-3-ethyl-4-(1,5,7,10-tetrazatricyclo[7.3.0.02,6]dodeca-2(6),3,7,9,11-pentaen-12-yl)-N-(2,2,2-trifluoroethyl)pyrrolidine-1-carboxamide Chemical compound CC[C@@H]1CN(C(=O)NCC(F)(F)F)C[C@@H]1C1=CN=C2N1C(C=CN1)=C1N=C2 WYQFJHHDOKWSHR-MNOVXSKESA-N 0.000 claims description 5
- HMLGSIZOMSVISS-ONJSNURVSA-N (7r)-7-[[(2z)-2-(2-amino-1,3-thiazol-4-yl)-2-(2,2-dimethylpropanoyloxymethoxyimino)acetyl]amino]-3-ethenyl-8-oxo-5-thia-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid Chemical compound N([C@@H]1C(N2C(=C(C=C)CSC21)C(O)=O)=O)C(=O)\C(=N/OCOC(=O)C(C)(C)C)C1=CSC(N)=N1 HMLGSIZOMSVISS-ONJSNURVSA-N 0.000 claims description 5
- MJZJYWCQPMNPRM-UHFFFAOYSA-N 6,6-dimethyl-1-[3-(2,4,5-trichlorophenoxy)propoxy]-1,6-dihydro-1,3,5-triazine-2,4-diamine Chemical compound CC1(C)N=C(N)N=C(N)N1OCCCOC1=CC(Cl)=C(Cl)C=C1Cl MJZJYWCQPMNPRM-UHFFFAOYSA-N 0.000 claims description 5
- CMSMOCZEIVJLDB-UHFFFAOYSA-N Cyclophosphamide Chemical compound ClCCN(CCCl)P1(=O)NCCCO1 CMSMOCZEIVJLDB-UHFFFAOYSA-N 0.000 claims description 5
- PMATZTZNYRCHOR-CGLBZJNRSA-N Cyclosporin A Chemical compound CC[C@@H]1NC(=O)[C@H]([C@H](O)[C@H](C)C\C=C\C)N(C)C(=O)[C@H](C(C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](C(C)C)NC(=O)[C@H](CC(C)C)N(C)C(=O)CN(C)C1=O PMATZTZNYRCHOR-CGLBZJNRSA-N 0.000 claims description 5
- 108010036949 Cyclosporine Proteins 0.000 claims description 5
- 108010008165 Etanercept Proteins 0.000 claims description 5
- 102000051628 Interleukin-1 receptor antagonist Human genes 0.000 claims description 5
- 108700021006 Interleukin-1 receptor antagonist Proteins 0.000 claims description 5
- UETNIIAIRMUTSM-UHFFFAOYSA-N Jacareubin Natural products CC1(C)OC2=CC3Oc4c(O)c(O)ccc4C(=O)C3C(=C2C=C1)O UETNIIAIRMUTSM-UHFFFAOYSA-N 0.000 claims description 5
- FBOZXECLQNJBKD-ZDUSSCGKSA-N L-methotrexate Chemical compound C=1N=C2N=C(N)N=C(N)C2=NC=1CN(C)C1=CC=C(C(=O)N[C@@H](CCC(O)=O)C(O)=O)C=C1 FBOZXECLQNJBKD-ZDUSSCGKSA-N 0.000 claims description 5
- 239000004012 Tofacitinib Substances 0.000 claims description 5
- 108060008682 Tumor Necrosis Factor Proteins 0.000 claims description 5
- 102100040247 Tumor necrosis factor Human genes 0.000 claims description 5
- 206010054094 Tumour necrosis Diseases 0.000 claims description 5
- 229960003697 abatacept Drugs 0.000 claims description 5
- 229960002964 adalimumab Drugs 0.000 claims description 5
- 229960004238 anakinra Drugs 0.000 claims description 5
- IMOZEMNVLZVGJZ-QGZVFWFLSA-N apremilast Chemical compound C1=C(OC)C(OCC)=CC([C@@H](CS(C)(=O)=O)N2C(C3=C(NC(C)=O)C=CC=C3C2=O)=O)=C1 IMOZEMNVLZVGJZ-QGZVFWFLSA-N 0.000 claims description 5
- 229960001164 apremilast Drugs 0.000 claims description 5
- LMEKQMALGUDUQG-UHFFFAOYSA-N azathioprine Chemical compound CN1C=NC([N+]([O-])=O)=C1SC1=NC=NC2=C1NC=N2 LMEKQMALGUDUQG-UHFFFAOYSA-N 0.000 claims description 5
- 229960002170 azathioprine Drugs 0.000 claims description 5
- 229950000971 baricitinib Drugs 0.000 claims description 5
- XUZMWHLSFXCVMG-UHFFFAOYSA-N baricitinib Chemical compound C1N(S(=O)(=O)CC)CC1(CC#N)N1N=CC(C=2C=3C=CNC=3N=CN=2)=C1 XUZMWHLSFXCVMG-UHFFFAOYSA-N 0.000 claims description 5
- 229960003270 belimumab Drugs 0.000 claims description 5
- 229960001838 canakinumab Drugs 0.000 claims description 5
- 229960003115 certolizumab pegol Drugs 0.000 claims description 5
- 229960001265 ciclosporin Drugs 0.000 claims description 5
- 229960004397 cyclophosphamide Drugs 0.000 claims description 5
- 229960000403 etanercept Drugs 0.000 claims description 5
- 229960001743 golimumab Drugs 0.000 claims description 5
- XXSMGPRMXLTPCZ-UHFFFAOYSA-N hydroxychloroquine Chemical compound ClC1=CC=C2C(NC(C)CCCN(CCO)CC)=CC=NC2=C1 XXSMGPRMXLTPCZ-UHFFFAOYSA-N 0.000 claims description 5
- 229960004171 hydroxychloroquine Drugs 0.000 claims description 5
- 229960000598 infliximab Drugs 0.000 claims description 5
- 229960005435 ixekizumab Drugs 0.000 claims description 5
- VHOGYURTWQBHIL-UHFFFAOYSA-N leflunomide Chemical compound O1N=CC(C(=O)NC=2C=CC(=CC=2)C(F)(F)F)=C1C VHOGYURTWQBHIL-UHFFFAOYSA-N 0.000 claims description 5
- 229960000681 leflunomide Drugs 0.000 claims description 5
- 229960000485 methotrexate Drugs 0.000 claims description 5
- HPNSFSBZBAHARI-UHFFFAOYSA-N micophenolic acid Natural products OC1=C(CC=C(C)CCC(O)=O)C(OC)=C(C)C2=C1C(=O)OC2 HPNSFSBZBAHARI-UHFFFAOYSA-N 0.000 claims description 5
- 229940014456 mycophenolate Drugs 0.000 claims description 5
- HPNSFSBZBAHARI-RUDMXATFSA-N mycophenolic acid Chemical compound OC1=C(C\C=C(/C)CCC(O)=O)C(OC)=C(C)C2=C1C(=O)OC2 HPNSFSBZBAHARI-RUDMXATFSA-N 0.000 claims description 5
- 229950006348 sarilumab Drugs 0.000 claims description 5
- 229960004540 secukinumab Drugs 0.000 claims description 5
- 229960001940 sulfasalazine Drugs 0.000 claims description 5
- NCEXYHBECQHGNR-QZQOTICOSA-N sulfasalazine Chemical compound C1=C(O)C(C(=O)O)=CC(\N=N\C=2C=CC(=CC=2)S(=O)(=O)NC=2N=CC=CC=2)=C1 NCEXYHBECQHGNR-QZQOTICOSA-N 0.000 claims description 5
- NCEXYHBECQHGNR-UHFFFAOYSA-N sulfasalazine Natural products C1=C(O)C(C(=O)O)=CC(N=NC=2C=CC(=CC=2)S(=O)(=O)NC=2N=CC=CC=2)=C1 NCEXYHBECQHGNR-UHFFFAOYSA-N 0.000 claims description 5
- 229960001350 tofacitinib Drugs 0.000 claims description 5
- UJLAWZDWDVHWOW-YPMHNXCESA-N tofacitinib Chemical compound C[C@@H]1CCN(C(=O)CC#N)C[C@@H]1N(C)C1=NC=NC2=C1C=CN2 UJLAWZDWDVHWOW-YPMHNXCESA-N 0.000 claims description 5
- 229950000088 upadacitinib Drugs 0.000 claims description 5
- 229960003824 ustekinumab Drugs 0.000 claims description 5
- 229940123121 B-cell inhibitor Drugs 0.000 claims description 4
- 102000015696 Interleukins Human genes 0.000 claims description 4
- 108010063738 Interleukins Proteins 0.000 claims description 4
- 230000035945 sensitivity Effects 0.000 claims description 4
- 201000000596 systemic lupus erythematosus Diseases 0.000 claims description 4
- 239000012472 biological sample Substances 0.000 claims description 2
- 206010025135 lupus erythematosus Diseases 0.000 claims description 2
- 230000009885 systemic effect Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 description 12
- 238000003860 storage Methods 0.000 description 11
- 238000004458 analytical method Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 6
- 238000013136 deep learning model Methods 0.000 description 5
- 101000891367 Homo sapiens Transcobalamin-1 Proteins 0.000 description 4
- 102100040396 Transcobalamin-1 Human genes 0.000 description 4
- 239000000654 additive Substances 0.000 description 4
- 230000000996 additive effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000001747 exhibiting effect Effects 0.000 description 4
- 238000007481 next generation sequencing Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000005192 partition Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 102100034510 2-phosphoxylose phosphatase 1 Human genes 0.000 description 2
- 102100039650 ADP-ribosylation factor-like protein 2 Human genes 0.000 description 2
- 102100029335 Beta-crystallin A2 Human genes 0.000 description 2
- 108700020472 CDC20 Proteins 0.000 description 2
- 102100024153 Cadherin-15 Human genes 0.000 description 2
- 102100021439 Cancer/testis antigen 62 Human genes 0.000 description 2
- 101710117701 Cancer/testis antigen 62 Proteins 0.000 description 2
- 101150023302 Cdc20 gene Proteins 0.000 description 2
- 102100023714 Coiled-coil domain-containing protein 73 Human genes 0.000 description 2
- 102100033885 Collagen alpha-2(XI) chain Human genes 0.000 description 2
- 108010044266 Dopamine Plasma Membrane Transport Proteins Proteins 0.000 description 2
- 102100029877 F-actin-uncapping protein LRRC16A Human genes 0.000 description 2
- 102100026748 Fatty acid-binding protein, intestinal Human genes 0.000 description 2
- 102100021642 Histone H2A type 2-A Human genes 0.000 description 2
- 101001132150 Homo sapiens 2-phosphoxylose phosphatase 1 Proteins 0.000 description 2
- 101000886101 Homo sapiens ADP-ribosylation factor-like protein 2 Proteins 0.000 description 2
- 101000919133 Homo sapiens Beta-crystallin A2 Proteins 0.000 description 2
- 101000762242 Homo sapiens Cadherin-15 Proteins 0.000 description 2
- 101000804783 Homo sapiens Chemokine XC receptor 1 Proteins 0.000 description 2
- 101000978316 Homo sapiens Coiled-coil domain-containing protein 73 Proteins 0.000 description 2
- 101000710619 Homo sapiens Collagen alpha-2(XI) chain Proteins 0.000 description 2
- 101000793823 Homo sapiens F-actin-uncapping protein LRRC16A Proteins 0.000 description 2
- 101000911337 Homo sapiens Fatty acid-binding protein, intestinal Proteins 0.000 description 2
- 101000898905 Homo sapiens Histone H2A type 2-A Proteins 0.000 description 2
- 101000840270 Homo sapiens Immunoglobulin lambda constant 7 Proteins 0.000 description 2
- 101001078144 Homo sapiens Meiotic recombination protein REC114 Proteins 0.000 description 2
- 101001134259 Homo sapiens Methyltransferase-like protein 25B Proteins 0.000 description 2
- 101000918983 Homo sapiens Neutrophil defensin 1 Proteins 0.000 description 2
- 101000603424 Homo sapiens Nuclear pore complex-interacting protein family member A3 Proteins 0.000 description 2
- 101001114056 Homo sapiens P antigen family member 2 Proteins 0.000 description 2
- 101001120413 Homo sapiens Peptidyl-prolyl cis-trans isomerase A-like 4C Proteins 0.000 description 2
- 101000903791 Homo sapiens Procollagen galactosyltransferase 2 Proteins 0.000 description 2
- 101000928034 Homo sapiens Proteasomal ubiquitin receptor ADRM1 Proteins 0.000 description 2
- 101000650117 Homo sapiens Protein Wnt-9a Proteins 0.000 description 2
- 101000582412 Homo sapiens Replication factor C subunit 5 Proteins 0.000 description 2
- 101000911790 Homo sapiens Sister chromatid cohesion protein DCC1 Proteins 0.000 description 2
- 101000585359 Homo sapiens Suppressor of tumorigenicity 20 protein Proteins 0.000 description 2
- 101000807991 Homo sapiens Testis-specific basic protein Y 1 Proteins 0.000 description 2
- 101001047681 Homo sapiens Tyrosine-protein kinase Lck Proteins 0.000 description 2
- 101000789849 Homo sapiens Ubiquitin-like-conjugating enzyme ATG10 Proteins 0.000 description 2
- 101000788819 Homo sapiens Zinc finger CCHC domain-containing protein 9 Proteins 0.000 description 2
- 102100029614 Immunoglobulin lambda constant 7 Human genes 0.000 description 2
- 102100025309 Meiotic recombination protein REC114 Human genes 0.000 description 2
- 102100034182 Methyltransferase-like protein 25B Human genes 0.000 description 2
- 102100029494 Neutrophil defensin 1 Human genes 0.000 description 2
- 102100038841 Nuclear pore complex-interacting protein family member A3 Human genes 0.000 description 2
- 102100023220 P antigen family member 2 Human genes 0.000 description 2
- 102100026132 Peptidyl-prolyl cis-trans isomerase A-like 4C Human genes 0.000 description 2
- 102100022973 Procollagen galactosyltransferase 2 Human genes 0.000 description 2
- 102100027503 Protein Wnt-9a Human genes 0.000 description 2
- 102100030541 Replication factor C subunit 5 Human genes 0.000 description 2
- 102000005029 SLC6A3 Human genes 0.000 description 2
- 101100010298 Schizosaccharomyces pombe (strain 972 / ATCC 24843) pol2 gene Proteins 0.000 description 2
- 102100027040 Sister chromatid cohesion protein DCC1 Human genes 0.000 description 2
- 101100054666 Streptomyces halstedii sch3 gene Proteins 0.000 description 2
- 102100029860 Suppressor of tumorigenicity 20 protein Human genes 0.000 description 2
- 102100038977 Testis-specific basic protein Y 1 Human genes 0.000 description 2
- 108010040633 Transforming Protein 3 Src Homology 2 Domain-Containing Proteins 0.000 description 2
- 102000002013 Transforming Protein 3 Src Homology 2 Domain-Containing Human genes 0.000 description 2
- 102100024036 Tyrosine-protein kinase Lck Human genes 0.000 description 2
- 102100028060 Ubiquitin-like-conjugating enzyme ATG10 Human genes 0.000 description 2
- 102100025362 Zinc finger CCHC domain-containing protein 9 Human genes 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000002085 persistent effect Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000001052 transient effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 210000003719 b-lymphocyte Anatomy 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 230000036992 cognitive tasks Effects 0.000 description 1
- 208000018631 connective tissue disease Diseases 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000004940 costimulation Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 229940046732 interleukin inhibitors Drugs 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 201000008482 osteoarthritis Diseases 0.000 description 1
- 230000001717 pathogenic effect Effects 0.000 description 1
- 230000000144 pharmacologic effect Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000523 sample Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 210000004872 soft tissue Anatomy 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 210000002700 urine Anatomy 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Definitions
- the subject matter described herein relates generally to machine learning and more specifically to biomarker selection for machine learning based techniques for predicting patient response to various treatments.
- a biological marker is a measurable indicator of a biological state or condition.
- a biomarker may be a medical sign that is capable of being measured accurately and reproducibly to provide an objective indication of a medical state of a patient. Accordingly, biomarkers are distinct from medical symptoms, which are patient perceived and reported indications of health and illness. Biomarkers can be measured and evaluated in a variety of manner including, for example, using blood, urine, or soft tissues and/or the like. Moreover, biomarkers may be used examine normal biological processes, pathogenic processes, as well as pharmacologic responses to therapeutic intervention.
- a system that includes at least one processor and at least one memory.
- the at least one memory may include program code that provides operations when executed by the at least one processor.
- the operations may include: partitioning a set of sequence data into a plurality of subsets of sequence data, each subset of sequence data including a different set of features comprising biomarker genes; applying, to each subset of sequence data, a plurality of machine learning models, each machine learning model of the plurality of machine learning models being trained to predict, based at least on the subset of sequence data, a patient response to a first treatment for a disease; generating an ensembled feature set including one or more features having a largest impact on a first performance of one or more top performing machine learning models applied to each subset of sequence data; applying the plurality of machine learning models to predict, based at least on the ensembled feature set, the patient response to the first treatment for the disease; and generating a final feature set including a subset of features from the ensembled feature set having a largest impact on a second performance of one or more top performing machine learning models applied to the ensembled feature set.
- a method for biomarker selection for machine learning enabled prediction of treatment response may include: partitioning a set of sequence data into a plurality of subsets of sequence data, each subset of sequence data including a different set of features comprising biomarker genes; applying, to each subset of sequence data, a plurality of machine learning models, each machine learning model of the plurality of machine learning models being trained to predict, based at least on the subset of sequence data, a patient response to a first treatment for a disease; generating an ensembled feature set including one or more features having a largest impact on a first performance of one or more top performing machine learning models applied to each subset of sequence data; applying the plurality of machine learning models to predict, based at least on the ensembled feature set, the patient response to the first treatment for the disease; and generating a final feature set including a subset of features from the ensembled feature set having a largest impact on a second performance of one or more top performing machine learning models applied to the ensembled feature
- a computer program product including a non-transitory computer readable medium storing instructions.
- the instructions may cause operations may executed by at least one data processor.
- the operations may include: partitioning a set of sequence data into a plurality of subsets of sequence data, each subset of sequence data including a different set of features comprising biomarker genes; applying, to each subset of sequence data, a plurality of machine learning models, each machine learning model of the plurality of machine learning models being trained to predict, based at least on the subset of sequence data, a patient response to a first treatment for a disease; generating an ensembled feature set including one or more features having a largest impact on a first performance of one or more top performing machine learning models applied to each subset of sequence data; applying the plurality of machine learning models to predict, based at least on the ensembled feature set, the patient response to the first treatment for the disease; and generating a final feature set including a subset of features from the ensembled feature set having a largest impact on
- the one or more top performing machine learning models applied to each subset of sequence data may be identified.
- a plurality of features included in each subset of sequence data may be ranked based on a respective impact of each feature on the first performance of the one or more top performing machine learning models applied to each subset of sequence data.
- the one or more features included in the ensembled feature set may be ranked based on a respective impact of each feature on the second performance of the one or more top performing machine learning models applied to the ensembled feature set.
- the plurality of subsets of sequence data may be generated by a random partitioning of the set of sequence data.
- the first performance and the second performance may include a recall, a precision, a specificity, an accuracy, and/or an area under a receiver operating characteristics (AUC-ROC) curve.
- AUC-ROC receiver operating characteristics
- a multifold generalization test may be performed on the ensembled feature set to generate the final feature set.
- the plurality of machine learning models may include a plurality of k-fold cross-validation models.
- the plurality of machine learning models may be trained to predict the patient response to the first treatment for rheumatoid arthritis.
- the plurality of machine learning models may be trained to predict the patient response to a conventional disease-modifying antirheumatic drug (cDMARD).
- cDMARD disease-modifying antirheumatic drug
- the conventional disease-modifying antirheumatic drug may include methotrexate, leflunomide, sulfasalazine, hydroxychloroquine, apremilast, azathioprine, ciclosporin, cyclophosphamide, and/or mycophenolate.
- the plurality of machine learning models may be trained to predict the patient response to a biologic rheumatoid arthritis agent.
- the biologic rheumatoid arthritis agent may be a tumor necrosis factor-a (TNF-a) inhibitor, a B-cell inhibitor, an interleukin inhibitor, a selective costimulation modulator, and/or a Janus kinase (JAK) inhibitor.
- the biologic rheumatoid arthritis agent may be tocilizumab, rituximab, abatacept, adalimumab, anakinra, belimumab, canakinumab, certolizumab, etanercept, golimumab, infliximab, ixekizumab, sarilumab, secukinumab, ustekinumab, baricitinib, tofacitinib, and/or upadacitinib.
- the set of sequence data may include ribonucleic acid (RNA) sequence data.
- RNA ribonucleic acid
- the set of sequence data may include a gene expression level of each of a plurality of biomarker genes.
- the set of sequence data may be received in an annotated table.
- a machine learning model trained to determine, based at least on the final feature set, the patient response to the first treatment for the disease may be applied.
- the final feature set may include a gene expression level of a plurality of biomarker genes present in a biological sample obtained from a patient.
- an effective quantity of the first treatment or a second treatment may be administered based at least on the patient response to the first treatment for the disease.
- the first treatment may include a conventional diseasemodifying antirheumatic drug (cDMARD) including antirheumatic drug (cDMARD) is methotrexate, leflunomide, sulfasalazine, hydroxychloroquine, apremilast, azathioprine, ciclosporin, cyclophosphamide, and/or mycophenolate.
- cDMARD diseasemodifying antirheumatic drug
- cDMARD antirheumatic drug
- mycophenolate mycophenolate
- the second treatment may include a biologic rheumatoid arthritis agent including a tumor necrosis factor-a (TNF-a) inhibitor, a B-cell inhibitor, an interleukin inhibitor, a selective co-stimulation modulator, and/or a Janus kinase (JAK) inhibitor.
- TNF-a tumor necrosis factor-a
- B-cell inhibitor an interleukin inhibitor
- JNK Janus kinase
- the second treatment may include a biologic rheumatoid arthritis agent including tocilizumab, rituximab, abatacept, adalimumab, anakinra, belimumab, canakinumab, certolizumab, etanercept, golimumab, infliximab, ixekizumab, sarilumab, secukinumab, ustekinumab, baricitinib, tofacitinib, and/or upadacitinib.
- a biologic rheumatoid arthritis agent including tocilizumab, rituximab, abatacept, adalimumab, anakinra, belimumab, canakinumab, certolizumab, etanercept, golimumab, infliximab, ixekizumab, sarilumab, secukinum
- the first treatment may be a conventional disease-modifying antirheumatic drug (cDMARD).
- the final feature set may include the biomarker genes ADRM1, ARL2, ATG10, CCDC73, CDH15, FABP2, ST20, WNT9A, ZCCHC9, and IGLC7.
- the first treatment may be tocilizumab.
- the final feature set may include the biomarker genes XCR1, COLGALT2, TCN1, NPIPA3, LCK, HIST2H2AA3, and SHC3.
- the first treatment may be rituximab.
- the final feature set may include the biomarker genes AL355102 2, CARMIL1, CT62, DEFA1B, DSCC1, PAGE2, PPIAL4C, RFC5, RRNAD1, SLC6A3, and VCY.
- the patient response to the first treatment may include a refractory response to the first treatment.
- the first treatment may be tocilizumab.
- the final feature set may include the biomarker genes CDC20, COL11A2, CRYBA2, PXYLP1, REC114, SCH3, TCN1, and TUBBl.
- Implementations of the current subject matter can include, but are not limited to, methods consistent with the descriptions provided herein as well as articles that comprise a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations implementing one or more of the described features.
- machines e.g., computers, etc.
- computer systems are also described that may include one or more processors and one or more memories coupled to the one or more processors.
- a memory which can include a non-transitory computer-readable or machine-readable storage medium, may include, encode, store, or the like one or more programs that cause one or more processors to perform one or more of the operations described herein.
- Computer implemented methods consistent with one or more implementations of the current subject matter can be implemented by one or more data processors residing in a single computing system or multiple computing systems. Such multiple computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including, for example, to a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.
- a network e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like
- FIG. 1 depicts a system diagram illustrating an example of a machine learning enabled treatment analysis system, in accordance with some example embodiments
- FIG. 2A depicts a flowchart illustrating an example of a biomarker selection process, in accordance with some example embodiments
- FIG. 2B depicts a flowchart illustrating another example of a biomarker selection process, in accordance with some example embodiments
- FIG. 3 A depicts graphs illustrating a performance comparison of an example of a machine learning predictive model operating on different feature sets, in accordance with some example embodiments
- FIG. 3B depicts graphs illustrating a performance comparison of another example of a machine learning predictive model operating on different feature sets, in accordance with some example embodiments
- FIG. 3C depicts graphs illustrating a performance comparison of another example of a machine learning predictive model operating on different feature sets, in accordance with some example embodiments
- FIG. 3D depicts graphs illustrating a performance comparison of another example of a machine learning predictive model operating on different feature sets, in accordance with some example embodiments
- FIG. 4 depicts a flowchart illustrating an example of a process for selecting biomarkers for machine learning enabled prediction of treatment response, in accordance with some example embodiments.
- FIG. 5 depicts a block diagram illustrating an example of a computing system, in accordance with some example embodiments.
- similar reference numbers denote similar structures, features, or elements.
- Machine learning models may be trained to perform a variety of cognitive tasks including, for example, object identification, natural language processing, information retrieval, speech recognition, and/or the like.
- a deep learning model such as, for example, a neural network, may be trained to perform a classification task by at least assigning input samples to one or more categories.
- the deep learning model may be trained to perform the classification task based on training data that has been labeled in accordance with the known category membership of each sample included in the training data.
- the deep learning model may be trained to perform a regression task.
- the regression task may require the deep learning model to predict, based at least on variations in one or more independent variables, corresponding changes in one or more dependent variables.
- a machine learning model may be trained to predict patient responses, including the likelihood of a patient exhibiting a response or and/or a refractory response, to a treatment for a disease.
- a machine learning model may be trained to predict patient response to various treatments for rheumatoid arthritis such as conventional disease-modifying antirheumatic drugs (cDMARDs) and biologic rheumatoid arthritis agents.
- cDMARDs disease-modifying antirheumatic drugs
- the performance of the machine learning model including the predictive accuracy of the machine learning model, may be contingent upon the features used by the machine learning model to derive its predictions.
- the machine learning model may require a fewer number of highly relevant features in order to produce high accuracy predictions.
- the features used by the machine learning model to predict patient response to a treatment for a disease may be identified by ensembling features, such as biomarker genes, having a largest impact on the performance of one or more top performing machine learning models applied to a variety of different feature sets.
- the features used by the machine learning model to predict patient response to a treatment for a disease may be identified by applying multiple machine learning models, such as an x-quantity of k- fold cross validation models, to a variety of different feature sets.
- Each feature set in this case may be a subset of features generated by a random partitioning of a set of sequence data, such as ribonucleic acid (RNA) sequence data and/or the like. Doing so may minimize the batch effect that may be present in the sequence data as individual features are distributed randomly across the resulting feature sets.
- RNA ribonucleic acid
- An ensembled feature set which may be used by the machine learning model to predict patient response to the treatment for the disease, may be generated to include features having the largest impact on the performance of the top performing machine learning models (e.g., a y-quantity of the top performing machine learning models) applied to the different feature sets.
- the ensembled feature set may undergo farther selection to generate a final feature set for use by the machine learning model to predict patient response to the treatment for the disease.
- multiple machine learning models such as the x-quantity of k-fold cross validation (CV) models, may be applied to predict, based on the ensembled feature set, patient response to the treatment for the disease.
- the final feature set may be generated to include features, such as biomarker genes, having the largest impact on the performance of the top performing machine learning models (e.g., a y- quantity of the top performing machine learning models) applied to the ensembled feature set.
- a multifold generalizability test may be applied to the features identified as having the largest impact on the performance of the top performing machine learning models applied to the ensemble feature set.
- the multifold generalizability test may be performed to ensure that the features included in the final feature may enable the machine learning model to maintain its performance when encountering new, previously unseen data drawn from the same distribution as the one used to create the model.
- one or more machine learning models may be applied to the ensemble feature set. The resulting best performing machine learning models may be selected for their performance.
- FIG. 1 depicts a system diagram illustrating an example of a machine learning enabled treatment analysis system 100, in accordance with some example embodiments.
- the machine learning enabled treatment analysis system 100 may include a machine learning controller 110, a treatment analysis engine 120, and a client device 130.
- the machine learning controller 110, the treatment analysis engine 120, and the client device 130 may be communicatively coupled via a network 140.
- the client device 130 may be a processor-based device including, for example, a workstation, a desktop computer, a laptop computer, a smartphone, a tablet computer, a wearable apparatus, and/or the like.
- the network 140 may be a wired network and/or a wireless network including, for example, a local area network (LAN), a virtual local area network (VLAN), a wide area network (WAN), a public land mobile network (PLMN), the Internet, and/or the like.
- LAN local area network
- VLAN virtual local area network
- WAN wide area network
- PLMN public land mobile network
- the treatment analysis engine 120 may include a machine learning model 125 trained to predict including the likelihood of a patient exhibiting a response and/or a refractory response, to a treatment for a disease.
- the machine learning model 125 may be trained to predict patient responses based on a feature set 115, which may include a set of biomarker genes. Individual patient response to a particular treatment may be determined based on the gene expression level of each biomarker gene included in the feature set 115. Accordingly, the machine learning model 125 may be trained to predict, based at least on the gene expression level of each biomarker gene associated with a patient, the response that the patient is likely to have to a treatment for a disease.
- the output of the machine learning model 125 including the predicted responses of one or more patients, may be displayed as a part of a user interface 135 at the client device 130.
- the machine learning model 125 may be trained to predict patient response to various treatments for rheumatoid arthritis such as conventional disease-modifying antirheumatic drugs (cDMARDs) and biologic rheumatoid arthritis agents.
- cDMARDs disease-modifying antirheumatic drugs
- biologic rheumatoid arthritis agents include heat, heat, and water.
- conventional disease-modifying antirheumatic drugs (cDMARDs) methotrexate, leflunomide, sulfasalazine, hydroxychloroquine, apremilast, azathioprine, ciclosporin, cyclophosphamide, and mycophenolate.
- Biologic rheumatoid arthritis agents may include, for example, tumor necrosis factor-a (TNF-a) inhibitors, B-cell inhibitors, interleukin inhibitors, selective co-stimulation modulators, and Janus kinase (JAK) inhibitors.
- biologic rheumatoid arthritis agents include tocilizumab, rituximab, abatacept, adalimumab, anakinra, belimumab, canakinumab, certolizumab, etanercept, golimumab, infliximab, ixekizumab, sarilumab, secukinumab, ustekinumab, baricitinib, tofacitinib, and upadacitinib.
- the performance of the machine learning model 125 may be contingent upon the features used by the machine learning model 125 to derive its predictions.
- the machine learning model 125 may require a fewer number of highly relevant features, such as biomarker genes, in order to produce high accuracy predictions.
- the machine learning controller 110 may be configured to generate the feature set 115 for use by the machine learning model 125 to predict patient responses to a treatment for a disease, such as the likelihood of a rheumatoid arthritis patient exhibiting a response or and/or a refractory response to conventional disease-modifying antirheumatic drugs (cDMARDs), biologic rheumatoid arthritis agents, and/or the like.
- the machine learning controller 110 may generate the feature set 115 by ensembling features, such as biomarker genes, having a largest impact on the performance of one or more top performing machine learning models applied to a variety of different feature sets, each of which including a different combination of features.
- FIG. 2A depicts a flowchart illustrating an example of a biomarker selection process 200, in accordance with some example embodiments.
- the machine learning controller 110 may generate multiple feature sets by randomly partitioning a sequence dataset 215, such as a set of ribonucleic acid (RNA) sequence data, into multiple subsets 225.
- a sequence dataset 215 such as a set of ribonucleic acid (RNA) sequence data
- the sequence dataset 215 which may originate from a sequencing platform 210 performing next generation sequencing (NGS)
- NGS next generation sequencing
- the machine learning controller 110 may partition the sequence dataset 215 randomly in order to distribute individual features randomly across the resulting subsets 225, thereby minimizing the batch effect that may be present in the sequence dataset 215.
- the machine learning controller 110 may apply multiple machine learning models, such as an x-quantity of k-fold cross validation (CV) models, to each of the subsets 225. That is, the x-quantity of k-fold cross validation (CV) models (e.g., 32 CV-10 models in the example shown in FIG. 2A) may be applied to predict, based on each of the subsets 225, patient responses to the treatment to the disease. For each subset 225, a y-quantity of the top performing models (e.g., five top performing models in the example shown in FIG. 2A) applied to each of the subsets 225 may be identified.
- CV x-quantity of k-fold cross validation
- Model performance in this case may be evaluated based on a variety of performance metrics including, for example, a log-loss metric, a sensitivity, a recall, a precision, a specificity, an accuracy, an area under a receiver operating characteristics (AUC-ROC) curve, and/or the like.
- performance metrics including, for example, a log-loss metric, a sensitivity, a recall, a precision, a specificity, an accuracy, an area under a receiver operating characteristics (AUC-ROC) curve, and/or the like.
- the features having the most impact on the performance of the top performing models for each subset 225 may be identified and aggregated to form an ensembled feature set 235.
- the impact of a feature on the performance of a machine learning model may be evaluated based on a variety of importance metrics including, for example, permutation feature importance (e.g., based on the decrease in model performance), SHapley Additive exPlanations (SHAP) feature importance (e.g., based on magnitude of feature contribution), and/or the like.
- permutation feature importance e.g., based on the decrease in model performance
- SHapley Additive exPlanations (SHAP) feature importance e.g., based on magnitude of feature contribution
- the ensembled feature set 235 may undergo fiirther selection in order to generate the feature set 115 for use by the machine learning model 125 to predict patient response to a treatment for a disease.
- the machine learning controller 110 may apply the multiple machine learning models, such as the -quantity of k-fold cross validation (CV) models ((e.g., 32 CV-10 models in the example shown in FIG. 2B), to predict, based at least on the ensembled feature set 235, patient responses to the treatment for the disease.
- CV k-fold cross validation
- the machine learning controller 110 may again identify, based on various performance metrics such as log-loss, sensitivity, recall, precision, specificity, accuracy, area under a receiver operating characteristics (AUC-ROC) curve, and/or the like, a y-quantity of the top performing models (e.g., five top performing models in the example shown in FIG. 2B) applied to the ensembled feature set 235. Moreover, the machine learning controller 110 may identify, based on various importance metrics such as permutation feature importance, SHapley Additive exPlanations (SHAP) feature importance, and/or the like, one or more features from the ensembled feature set 235 having the most impact on the performance of the machine learning model 125.
- various performance metrics such as log-loss, sensitivity, recall, precision, specificity, accuracy, area under a receiver operating characteristics (AUC-ROC) curve, and/or the like
- a y-quantity of the top performing models e.g., five top performing models in the example shown in FIG. 2B
- the features from the ensembled feature set 235 identified as having the most impact on the performance of the machine learning model 125 may form a candidate feature set 245 that undergoes further selection before being added to the feature set 115.
- the machine learning controller 110 may apply, to the candidate feature set 245, a multifold generalizability test to identify features that enable the machine learning model 125 to maintain its performance when encountering new, previously unseen data drawn from the same distribution as the one used to create the machine learning model 125.
- the feature set 115 may include fewer but more relevant features for predicting patient response to a treatment for a particular disease such as conventional disease-modifying antirheumatic drugs (cDMARDs) and biologic rheumatoid arthritis agents for the treatment of rheumatoid arthritis.
- cDMARDs disease-modifying antirheumatic drugs
- biologic rheumatoid arthritis agents for the treatment of rheumatoid arthritis.
- FIGS. 3A-D depict graphs illustrating performance comparisons of the machine learning model 125 operating on different feature sets, in accordance with some example embodiments.
- the performance of the machine learning model 125 operating on two different feature sets including a first set of biomarker genes selected by applying conventional techniques (Set A) and a second set of biomarker genes selected in accordance with various implementations of the ensembling techniques described herein (Set B), is evaluated based on performance metrics such as an area under a receiver operating characteristics (AUC-ROC) curve, specificity, and sensitivity.
- the second set of biomarker genes (Set B) includes fewer but more relevant biomarker genes than the first set of biomarker genes (Set A).
- the second set of biomarker genes (Set B) includes 10 biomarker genes while the first set of biomarker genes (Set A) includes 11 biomarker genes.
- the second set of biomarker genes (Set B) includes 7 biomarker genes whereas the first set of biomarker genes (Set A) includes 39 biomarker genes.
- the first set of biomarker genes (Set A) includes 40 biomarker genes while the second set of biomarker genes (Set B) includes 11 biomarker genes.
- the first set of biomarker genes (Set A) includes 53 biomarker genes while the second set of biomarker genes (Set B) includes 8 biomarker genes.
- the machine learning model 125 is able to achieve better performance when predicting patient response using the second set of biomarker genes (Set B) then when using the first set of biomarker genes (Set A).
- FIG. 3 A shows that the machine learning model 125 achieved a 28.3% higher area under a receiver operating characteristics (AUC-ROC) curve when predicting patient response to conventional disease-modifying antirheumatic drug (cDMARDs) using the second set of biomarker genes (Set B) containing the biomarker genes ADRM1, ARL2, ATG10, CCDC73, CDH15, FABP2, ST20, WNT9A, ZCCHC9, and IGLC7.
- AUC-ROC receiver operating characteristics
- 3B shows that the machine learning model 125 achieved a 26.4% higher under a receiver operating characteristics (AUC-ROC) curve when predicting patient response to tocilizumab using the second set of biomarker genes (Set B) containing the biomarker genes XCR1, COLGALT2, TCN1, NPIPA3, LCK, HIST2H2AA3, and SHC3.
- AUC-ROC receiver operating characteristics
- 3C shows that the machine learning model 125 achieved a 29.7% higher under a receiver operating characteristics (AUC-ROC) curve when predicting patient response to rituximab using the second set of biomarker genes (Set B) containing the biomarker genes AL355102 2, CARMIL1, CT62, DEFA1B, DSCC1, PAGE2, PPIAL4C, RFC5, RRNAD1, SLC6A3, and VCY.
- AUC-ROC receiver operating characteristics
- 3D shows that the machine learning model 125 achieved a 33.3% higher under a receiver operating characteristics (AUC-ROC) curve when predicting patient refractory response to tocilizumab using the second set of biomarker genes (Set B) containing the biomarker genes CDC20, COL11A2, CRYBA2, PXYLP1, REC114, SCH3, TCN1, and TUBBl.
- AUC-ROC receiver operating characteristics
- FIG. 4 depicts a flowchart illustrating an example of a process 400 for selecting biomarkers for machine learning enabled prediction of treatment response, in accordance with some example embodiments.
- the process 400 may be performed by the machine learning controller 110, for example, to generate the feature set 115 for use by the machine learning model 125 to predict patient responses to a treatment for a disease, such as the likelihood of a rheumatoid arthritis patient exhibiting a response or and/or a refractory response to conventional disease-modifying antirheumatic drugs (cDMARDs), biologic rheumatoid arthritis agents, and/or the like.
- cDMARDs disease-modifying antirheumatic drugs
- the machine learning controller 110 may partition a set of sequence data into a plurality of subsets of sequence data.
- the machine learning controller 110 may receive the sequence dataset 215, for example, from the sequencing platform 210 performing next generation sequencing (NGS).
- NGS next generation sequencing
- the machine learning controller 110 may partition the sequence dataset 215 into multiple subsets 225, which in the example shown in FIG. 2A includes the first subset 225a, the second subset 225b, the third subset 225c, the fourth subset 225d, and the fifth subset 225e.
- the machine learning controller 110 partition the sequence dataset 215 randomly, which distributes individual features randomly across the resulting subsets 225.
- the machine learning controller 110 may apply, to each subset of sequence data, a plurality of machine learning models trained to predict a patient response to a treatment for a disease.
- the machine learning controller 110 may apply an x- quantity of k-fold cross validation (CV) models (e.g., 32 CV-10 models in the example shown in FIG. 2A) to predict, based on each of the subsets 225, patient responses to the treatment to the disease.
- CV k-fold cross validation
- the machine learning controller 110 may generate an ensembled feature set including one or more features having a largest impact on a first performance of one or more top performing machine learning models predicting patient response based on each subset of sequence data.
- the machine learning controller 110 may identify, for each subset 225, a y-quantity of the top performing models (e.g., five top performing models in the example shown in FIG. 2A).
- model performance may be evaluated based on a variety of performance metrics including, for example, a recall, a precision, a specificity, an accuracy, an area under a receiver operating characteristics (AUC- ROC) curve, and/or the like.
- the features having the most impact on the performance of the top performing models for each subset 225 may be identified and aggregated to form the ensembled feature set 235.
- the impact of a feature on the performance of a machine learning model may be evaluated based on a variety of importance metrics including, for example, permutation feature importance (e.g., based on the decrease in model performance), SHapley Additive exPlanations (SHAP) feature importance (e.g., based on magnitude of feature contribution), and/or the like.
- permutation feature importance e.g., based on the decrease in model performance
- SHapley Additive exPlanations (SHAP) feature importance e.g., based on magnitude of feature contribution
- the machine learning controller 110 may apply the plurality of machine learning models to predict, based at least on the ensembled feature set, the patient response to the treatment for the disease.
- the machine learning controller 110 may subject the ensembled feature set 235 to fiirther selection in order to generate the feature set 115 for use by the machine learning model 125 to predict patient response to a treatment for a disease.
- the machine learning controller 110 may apply the multiple machine learning models, such as the x-quantity of k-fold cross validation (CV) models (e.g., 32 CV-10 models in the example shown in FIG. 2B), to predict, based at least on the ensembled feature set 235, patient responses to the treatment for the disease.
- CV k-fold cross validation
- a -quantity of the top performing models (e.g., five top performing models in the example shown in FIG. 2B) applied to the ensembled feature set 235 may be identified based on various performance metrics such as recall, precision, specificity, accuracy, area under a receiver operating characteristics (AUC-ROC) curve, and/or the like.
- one or more features from the ensembled feature set 235 having the most impact on the performance of the machine learning model 120 may be identified based on various importance metrics such as permutation feature importance, SHapley Additive exPlanations (SHAP) feature importance, and/or the like.
- various performance metrics such as recall, precision, specificity, accuracy, area under a receiver operating characteristics (AUC-ROC) curve, and/or the like.
- AUC-ROC receiver operating characteristics
- one or more features from the ensembled feature set 235 having the most impact on the performance of the machine learning model 120 may be identified based on various importance metrics such as permutation feature importance, SHapley Additive exPlanations (SHAP) feature importance,
- the machine learning controller 110 may generate a final feature set including a subset of features from the ensembled feature set having a largest impact on a second performance of one or more top performing machine learning models predicting patient response based on to the ensembled feature set.
- the features from the ensembled feature set 235 identified as having the most impact on the performance of the machine learning model 125 may form a candidate feature set 245 that undergo further selection before being added to the feature set 115. For instance, in the example shown in FIG.
- the machine learning controller 110 may apply, to the candidate feature set 245, a multifold generalizability test to identify features that enable the machine learning model 125 to maintain its performance when encountering new, previously unseen data drawn from the same distribution as the one used to create the machine learning model 125.
- the resulting feature set 115 may include fewer but more relevant features for predicting patient response to a treatment for a particular disease.
- the feature set 115 may include a panel of biomarker genes that the machine learning model 125 may use to predict whether a rheumatoid arthritis patient will exhibit a response and/or a refractory response to rheumatoid arthritis treatments such as conventional disease-modifying antirheumatic drugs (cDMARDs), biologic rheumatoid arthritis agents, and/or the like.
- cDMARDs disease-modifying antirheumatic drugs
- biologic rheumatoid arthritis agents and/or the like.
- one or more machine learning models may be used.
- the one or more machine learning models 125 may include one or more linear regression models, logistic regression models, gradient boosting models, random forest models, keras neural networks, neural networks, deep learning models, generalized linear models, light gradient boosting classifiers, extreme gradient boosting classifiers, elastic net classifiers, or the like.
- the plurality of machine learning models may include machine learning models with similar architectures and but various parameters.
- the process illustrated in FIG. 4 can be applied to applications such as rheumatoid arthritis, systemic lupus, systemic lupus erythematosus (SLE), osteoarthritis, and/or connective tissue diseases.
- sequence data may include proteomics data and/or RNA data.
- the process illustrated in FIG. 4 can be used for prediction, diagnosis, and/or monitoring applications.
- FIG. 5 depicts a block diagram illustrating an example of computing system 500, in accordance with some example embodiments.
- the computing system 500 may be used to implement the machine learning controller 110, the treatment analysis engine 120, the client device 130, and/or any components therein.
- the computing system 500 can include a processor 510, a memory 520, a storage device 530, and input/output devices 540.
- the processor 510, the memory 520, the storage device 530, and the input/output devices 540 can be interconnected via a system bus 550.
- the processor 510 is capable of processing instructions for execution within the computing system 500. Such executed instructions can implement one or more components of, for example, the machine learning controller 110, the treatment analysis engine 120, the client device 130, and/or the like.
- the processor 510 can be a single-threaded processor. Alternately, the processor 510 can be a multi -threaded processor.
- the processor 510 is capable of processing instructions stored in the memory 520 and/or on the storage device 530 to display graphical information for a user interface provided via the input/output device 540.
- the memory 520 is a computer readable medium such as volatile or non-volatile that stores information within the computing system 500.
- the memory 520 can store data structures representing configuration object databases, for example.
- the storage device 530 is capable of providing persistent storage for the computing system 500.
- the storage device 530 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means.
- the input/output device 540 provides input/output operations for the computing system 500.
- the input/output device 540 includes a keyboard and/or pointing device.
- the input/output device 540 includes a display unit for displaying graphical user interfaces.
- the input/output device 540 can provide input/output operations for a network device.
- the input/output device 540 can include Ethernet ports or other networking ports to communicate with one or more wired and/or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).
- LAN local area network
- WAN wide area network
- the Internet the Internet
- the computing system 500 can be used to execute various interactive computer software applications that can be used for organization, analysis and/or storage of data in various formats.
- the computing system 500 can be used to execute any type of software applications.
- These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and/or any other objects, etc.), computing functionalities, communications functionalities, etc.
- the applications can include various add-in functionalities or can be standalone computing products and/or functionalities.
- the functionalities can be used to generate the user interface provided via the input/output device 540.
- the user interface can be generated and presented to a user by the computing system 500 (e.g., on a computer screen monitor, etc.).
- One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs, field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof.
- These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
- the programmable system or computing system may include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
- the machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium.
- the machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example, as would a processor cache or other random access memory associated with one or more physical processor cores.
- one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer.
- a display device such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user
- LCD liquid crystal display
- LED light emitting diode
- a keyboard and a pointing device such as for example a mouse or a trackball
- feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.
- Other possible input devices include touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive track pads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.
- phrases such as “at least one of’ or “one or more of’ may occur followed by a conjunctive list of elements or features.
- the term “and/or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features.
- the phrases “at least one of A and B;” “one or more of A and B;” and “A and/or B” are each intended to mean “A alone, B alone, or A and B together.”
- a similar interpretation is also intended for lists including three or more items.
- the phrases “at least one of A, B, and C;” “one or more of A, B, and C;” and “A, B, and/or C” are each intended to mean “A alone, B alone, C alone, A and B together, A and C together, B and C together, or A and B and C together.”
- Use of the term “based on,” above and in the claims is intended to mean, “based at least in part on,” such that an unrecited feature or element is also permissible.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Public Health (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Primary Health Care (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Pathology (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Un procédé peut comprendre le partitionnement d'un ensemble de données de séquence en une pluralité de sous-ensembles de données de séquence, comprenant chacun un ensemble différent de caractéristiques comprenant des gènes de biomarqueur. Une pluralité de modèles d'apprentissage automatique peuvent être appliqués à chaque sous-ensemble de données de séquence. Chaque modèle d'apprentissage automatique peut être entraîné pour prédire, sur la base du sous-ensemble de données de séquence, une réponse de patient à un traitement d'une maladie. Un ensemble de caractéristiques regroupées peut être généré pour comprendre des caractéristiques ayant un impact le plus grand sur une première performance de modèles d'apprentissage automatique les plus performants appliqués à chaque sous-ensemble de données de séquence. La pluralité de modèles d'apprentissage automatique peuvent être appliqués pour prédire, sur la base de l'ensemble de caractéristiques regroupées, la réponse du patient au premier traitement de la maladie.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263415500P | 2022-10-12 | 2022-10-12 | |
US63/415,500 | 2022-10-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024081737A1 true WO2024081737A1 (fr) | 2024-04-18 |
Family
ID=90670324
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/076606 WO2024081737A1 (fr) | 2022-10-12 | 2023-10-11 | Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024081737A1 (fr) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180358132A1 (en) * | 2017-06-13 | 2018-12-13 | Alexander Bagaev | Systems and methods for identifying cancer treatments from normalized biomarker scores |
US20210280271A1 (en) * | 2019-06-27 | 2021-09-09 | Scipher Medicine Corporation | Developing classifiers for stratifying patients |
-
2023
- 2023-10-11 WO PCT/US2023/076606 patent/WO2024081737A1/fr unknown
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180358132A1 (en) * | 2017-06-13 | 2018-12-13 | Alexander Bagaev | Systems and methods for identifying cancer treatments from normalized biomarker scores |
US20210280271A1 (en) * | 2019-06-27 | 2021-09-09 | Scipher Medicine Corporation | Developing classifiers for stratifying patients |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Iniesta et al. | Machine learning, statistical learning and the future of biological research in psychiatry | |
Linden et al. | Modeling time‐to‐event (survival) data using classification tree analysis | |
Ben-Ari Fuchs et al. | GeneAnalytics: an integrative gene set analysis tool for next generation sequencing, RNAseq and microarray data | |
Dingen et al. | RegressionExplorer: Interactive exploration of logistic regression models with subgroup analysis | |
Le-Rademacher et al. | Application of multi-state models in cancer clinical trials | |
US20210342212A1 (en) | Method and system for identifying root causes | |
Boyce et al. | Bridging islands of information to establish an integrated knowledge base of drugs and health outcomes of interest | |
CN105144178A (zh) | 用于临床决策支持的系统和方法 | |
US20160117470A1 (en) | Personalized medicine service | |
Gobbel et al. | Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives | |
Maier et al. | Reinforcement learning and Bayesian data assimilation for model‐informed precision dosing in oncology | |
Luo et al. | Using machine learning approaches to predict high-cost chronic obstructive pulmonary disease patients in China | |
Chapfuwa et al. | Enabling counterfactual survival analysis with balanced representations | |
Cockrell et al. | Nested active learning for efficient model contextualization and parameterization: pathway to generating simulated populations using multi-scale computational models | |
Gudin et al. | Reducing opioid prescriptions by identifying responders on topical analgesic treatment using an individualized medicine and predictive analytics approach | |
Nicholls et al. | Comparison of sparse biclustering algorithms for gene expression datasets | |
US11354591B2 (en) | Identifying gene signatures and corresponding biological pathways based on an automatically curated genomic database | |
Bayramli et al. | Predictive structured–unstructured interactions in EHR models: A case study of suicide prediction | |
Shen et al. | Estimating the optimal personalized treatment strategy based on selected variables to prolong survival via random survival forest with weighted bootstrap | |
Babayoff et al. | Surgery duration: Optimized prediction and causality analysis | |
WO2024081737A1 (fr) | Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement | |
Mandal et al. | Reconstruction of dominant gene regulatory network from microarray data using rough set and bayesian approach | |
Galimzhanov et al. | Prediction of clinical outcomes after percutaneous coronary intervention: Machine-learning analysis of the National Inpatient Sample | |
US20210174912A1 (en) | Data processing systems and methods for repurposing drugs | |
Kanyongo et al. | Machine learning approaches to medication adherence amongst NCD patients: A systematic literature review |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23878196 Country of ref document: EP Kind code of ref document: A1 |