WO2024081737A1 - Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement - Google Patents

Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement Download PDF

Info

Publication number
WO2024081737A1
WO2024081737A1 PCT/US2023/076606 US2023076606W WO2024081737A1 WO 2024081737 A1 WO2024081737 A1 WO 2024081737A1 US 2023076606 W US2023076606 W US 2023076606W WO 2024081737 A1 WO2024081737 A1 WO 2024081737A1
Authority
WO
WIPO (PCT)
Prior art keywords
machine learning
sequence data
treatment
feature set
learning models
Prior art date
Application number
PCT/US2023/076606
Other languages
English (en)
Inventor
Giorgio CASABURI
Original Assignee
Exagen Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Exagen Inc. filed Critical Exagen Inc.
Publication of WO2024081737A1 publication Critical patent/WO2024081737A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Definitions

  • the subject matter described herein relates generally to machine learning and more specifically to biomarker selection for machine learning based techniques for predicting patient response to various treatments.
  • a biological marker is a measurable indicator of a biological state or condition.
  • a biomarker may be a medical sign that is capable of being measured accurately and reproducibly to provide an objective indication of a medical state of a patient. Accordingly, biomarkers are distinct from medical symptoms, which are patient perceived and reported indications of health and illness. Biomarkers can be measured and evaluated in a variety of manner including, for example, using blood, urine, or soft tissues and/or the like. Moreover, biomarkers may be used examine normal biological processes, pathogenic processes, as well as pharmacologic responses to therapeutic intervention.
  • a system that includes at least one processor and at least one memory.
  • the at least one memory may include program code that provides operations when executed by the at least one processor.
  • the operations may include: partitioning a set of sequence data into a plurality of subsets of sequence data, each subset of sequence data including a different set of features comprising biomarker genes; applying, to each subset of sequence data, a plurality of machine learning models, each machine learning model of the plurality of machine learning models being trained to predict, based at least on the subset of sequence data, a patient response to a first treatment for a disease; generating an ensembled feature set including one or more features having a largest impact on a first performance of one or more top performing machine learning models applied to each subset of sequence data; applying the plurality of machine learning models to predict, based at least on the ensembled feature set, the patient response to the first treatment for the disease; and generating a final feature set including a subset of features from the ensembled feature set having a largest impact on a second performance of one or more top performing machine learning models applied to the ensembled feature set.
  • a method for biomarker selection for machine learning enabled prediction of treatment response may include: partitioning a set of sequence data into a plurality of subsets of sequence data, each subset of sequence data including a different set of features comprising biomarker genes; applying, to each subset of sequence data, a plurality of machine learning models, each machine learning model of the plurality of machine learning models being trained to predict, based at least on the subset of sequence data, a patient response to a first treatment for a disease; generating an ensembled feature set including one or more features having a largest impact on a first performance of one or more top performing machine learning models applied to each subset of sequence data; applying the plurality of machine learning models to predict, based at least on the ensembled feature set, the patient response to the first treatment for the disease; and generating a final feature set including a subset of features from the ensembled feature set having a largest impact on a second performance of one or more top performing machine learning models applied to the ensembled feature
  • a computer program product including a non-transitory computer readable medium storing instructions.
  • the instructions may cause operations may executed by at least one data processor.
  • the operations may include: partitioning a set of sequence data into a plurality of subsets of sequence data, each subset of sequence data including a different set of features comprising biomarker genes; applying, to each subset of sequence data, a plurality of machine learning models, each machine learning model of the plurality of machine learning models being trained to predict, based at least on the subset of sequence data, a patient response to a first treatment for a disease; generating an ensembled feature set including one or more features having a largest impact on a first performance of one or more top performing machine learning models applied to each subset of sequence data; applying the plurality of machine learning models to predict, based at least on the ensembled feature set, the patient response to the first treatment for the disease; and generating a final feature set including a subset of features from the ensembled feature set having a largest impact on
  • the one or more top performing machine learning models applied to each subset of sequence data may be identified.
  • a plurality of features included in each subset of sequence data may be ranked based on a respective impact of each feature on the first performance of the one or more top performing machine learning models applied to each subset of sequence data.
  • the one or more features included in the ensembled feature set may be ranked based on a respective impact of each feature on the second performance of the one or more top performing machine learning models applied to the ensembled feature set.
  • the plurality of subsets of sequence data may be generated by a random partitioning of the set of sequence data.
  • the first performance and the second performance may include a recall, a precision, a specificity, an accuracy, and/or an area under a receiver operating characteristics (AUC-ROC) curve.
  • AUC-ROC receiver operating characteristics
  • a multifold generalization test may be performed on the ensembled feature set to generate the final feature set.
  • the plurality of machine learning models may include a plurality of k-fold cross-validation models.
  • the plurality of machine learning models may be trained to predict the patient response to the first treatment for rheumatoid arthritis.
  • the plurality of machine learning models may be trained to predict the patient response to a conventional disease-modifying antirheumatic drug (cDMARD).
  • cDMARD disease-modifying antirheumatic drug
  • the conventional disease-modifying antirheumatic drug may include methotrexate, leflunomide, sulfasalazine, hydroxychloroquine, apremilast, azathioprine, ciclosporin, cyclophosphamide, and/or mycophenolate.
  • the plurality of machine learning models may be trained to predict the patient response to a biologic rheumatoid arthritis agent.
  • the biologic rheumatoid arthritis agent may be a tumor necrosis factor-a (TNF-a) inhibitor, a B-cell inhibitor, an interleukin inhibitor, a selective costimulation modulator, and/or a Janus kinase (JAK) inhibitor.
  • the biologic rheumatoid arthritis agent may be tocilizumab, rituximab, abatacept, adalimumab, anakinra, belimumab, canakinumab, certolizumab, etanercept, golimumab, infliximab, ixekizumab, sarilumab, secukinumab, ustekinumab, baricitinib, tofacitinib, and/or upadacitinib.
  • the set of sequence data may include ribonucleic acid (RNA) sequence data.
  • RNA ribonucleic acid
  • the set of sequence data may include a gene expression level of each of a plurality of biomarker genes.
  • the set of sequence data may be received in an annotated table.
  • a machine learning model trained to determine, based at least on the final feature set, the patient response to the first treatment for the disease may be applied.
  • the final feature set may include a gene expression level of a plurality of biomarker genes present in a biological sample obtained from a patient.
  • an effective quantity of the first treatment or a second treatment may be administered based at least on the patient response to the first treatment for the disease.
  • the first treatment may include a conventional diseasemodifying antirheumatic drug (cDMARD) including antirheumatic drug (cDMARD) is methotrexate, leflunomide, sulfasalazine, hydroxychloroquine, apremilast, azathioprine, ciclosporin, cyclophosphamide, and/or mycophenolate.
  • cDMARD diseasemodifying antirheumatic drug
  • cDMARD antirheumatic drug
  • mycophenolate mycophenolate
  • the second treatment may include a biologic rheumatoid arthritis agent including a tumor necrosis factor-a (TNF-a) inhibitor, a B-cell inhibitor, an interleukin inhibitor, a selective co-stimulation modulator, and/or a Janus kinase (JAK) inhibitor.
  • TNF-a tumor necrosis factor-a
  • B-cell inhibitor an interleukin inhibitor
  • JNK Janus kinase
  • the second treatment may include a biologic rheumatoid arthritis agent including tocilizumab, rituximab, abatacept, adalimumab, anakinra, belimumab, canakinumab, certolizumab, etanercept, golimumab, infliximab, ixekizumab, sarilumab, secukinumab, ustekinumab, baricitinib, tofacitinib, and/or upadacitinib.
  • a biologic rheumatoid arthritis agent including tocilizumab, rituximab, abatacept, adalimumab, anakinra, belimumab, canakinumab, certolizumab, etanercept, golimumab, infliximab, ixekizumab, sarilumab, secukinum
  • the first treatment may be a conventional disease-modifying antirheumatic drug (cDMARD).
  • the final feature set may include the biomarker genes ADRM1, ARL2, ATG10, CCDC73, CDH15, FABP2, ST20, WNT9A, ZCCHC9, and IGLC7.
  • the first treatment may be tocilizumab.
  • the final feature set may include the biomarker genes XCR1, COLGALT2, TCN1, NPIPA3, LCK, HIST2H2AA3, and SHC3.
  • the first treatment may be rituximab.
  • the final feature set may include the biomarker genes AL355102 2, CARMIL1, CT62, DEFA1B, DSCC1, PAGE2, PPIAL4C, RFC5, RRNAD1, SLC6A3, and VCY.
  • the patient response to the first treatment may include a refractory response to the first treatment.
  • the first treatment may be tocilizumab.
  • the final feature set may include the biomarker genes CDC20, COL11A2, CRYBA2, PXYLP1, REC114, SCH3, TCN1, and TUBBl.
  • Implementations of the current subject matter can include, but are not limited to, methods consistent with the descriptions provided herein as well as articles that comprise a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations implementing one or more of the described features.
  • machines e.g., computers, etc.
  • computer systems are also described that may include one or more processors and one or more memories coupled to the one or more processors.
  • a memory which can include a non-transitory computer-readable or machine-readable storage medium, may include, encode, store, or the like one or more programs that cause one or more processors to perform one or more of the operations described herein.
  • Computer implemented methods consistent with one or more implementations of the current subject matter can be implemented by one or more data processors residing in a single computing system or multiple computing systems. Such multiple computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including, for example, to a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.
  • a network e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like
  • FIG. 1 depicts a system diagram illustrating an example of a machine learning enabled treatment analysis system, in accordance with some example embodiments
  • FIG. 2A depicts a flowchart illustrating an example of a biomarker selection process, in accordance with some example embodiments
  • FIG. 2B depicts a flowchart illustrating another example of a biomarker selection process, in accordance with some example embodiments
  • FIG. 3 A depicts graphs illustrating a performance comparison of an example of a machine learning predictive model operating on different feature sets, in accordance with some example embodiments
  • FIG. 3B depicts graphs illustrating a performance comparison of another example of a machine learning predictive model operating on different feature sets, in accordance with some example embodiments
  • FIG. 3C depicts graphs illustrating a performance comparison of another example of a machine learning predictive model operating on different feature sets, in accordance with some example embodiments
  • FIG. 3D depicts graphs illustrating a performance comparison of another example of a machine learning predictive model operating on different feature sets, in accordance with some example embodiments
  • FIG. 4 depicts a flowchart illustrating an example of a process for selecting biomarkers for machine learning enabled prediction of treatment response, in accordance with some example embodiments.
  • FIG. 5 depicts a block diagram illustrating an example of a computing system, in accordance with some example embodiments.
  • similar reference numbers denote similar structures, features, or elements.
  • Machine learning models may be trained to perform a variety of cognitive tasks including, for example, object identification, natural language processing, information retrieval, speech recognition, and/or the like.
  • a deep learning model such as, for example, a neural network, may be trained to perform a classification task by at least assigning input samples to one or more categories.
  • the deep learning model may be trained to perform the classification task based on training data that has been labeled in accordance with the known category membership of each sample included in the training data.
  • the deep learning model may be trained to perform a regression task.
  • the regression task may require the deep learning model to predict, based at least on variations in one or more independent variables, corresponding changes in one or more dependent variables.
  • a machine learning model may be trained to predict patient responses, including the likelihood of a patient exhibiting a response or and/or a refractory response, to a treatment for a disease.
  • a machine learning model may be trained to predict patient response to various treatments for rheumatoid arthritis such as conventional disease-modifying antirheumatic drugs (cDMARDs) and biologic rheumatoid arthritis agents.
  • cDMARDs disease-modifying antirheumatic drugs
  • the performance of the machine learning model including the predictive accuracy of the machine learning model, may be contingent upon the features used by the machine learning model to derive its predictions.
  • the machine learning model may require a fewer number of highly relevant features in order to produce high accuracy predictions.
  • the features used by the machine learning model to predict patient response to a treatment for a disease may be identified by ensembling features, such as biomarker genes, having a largest impact on the performance of one or more top performing machine learning models applied to a variety of different feature sets.
  • the features used by the machine learning model to predict patient response to a treatment for a disease may be identified by applying multiple machine learning models, such as an x-quantity of k- fold cross validation models, to a variety of different feature sets.
  • Each feature set in this case may be a subset of features generated by a random partitioning of a set of sequence data, such as ribonucleic acid (RNA) sequence data and/or the like. Doing so may minimize the batch effect that may be present in the sequence data as individual features are distributed randomly across the resulting feature sets.
  • RNA ribonucleic acid
  • An ensembled feature set which may be used by the machine learning model to predict patient response to the treatment for the disease, may be generated to include features having the largest impact on the performance of the top performing machine learning models (e.g., a y-quantity of the top performing machine learning models) applied to the different feature sets.
  • the ensembled feature set may undergo farther selection to generate a final feature set for use by the machine learning model to predict patient response to the treatment for the disease.
  • multiple machine learning models such as the x-quantity of k-fold cross validation (CV) models, may be applied to predict, based on the ensembled feature set, patient response to the treatment for the disease.
  • the final feature set may be generated to include features, such as biomarker genes, having the largest impact on the performance of the top performing machine learning models (e.g., a y- quantity of the top performing machine learning models) applied to the ensembled feature set.
  • a multifold generalizability test may be applied to the features identified as having the largest impact on the performance of the top performing machine learning models applied to the ensemble feature set.
  • the multifold generalizability test may be performed to ensure that the features included in the final feature may enable the machine learning model to maintain its performance when encountering new, previously unseen data drawn from the same distribution as the one used to create the model.
  • one or more machine learning models may be applied to the ensemble feature set. The resulting best performing machine learning models may be selected for their performance.
  • FIG. 1 depicts a system diagram illustrating an example of a machine learning enabled treatment analysis system 100, in accordance with some example embodiments.
  • the machine learning enabled treatment analysis system 100 may include a machine learning controller 110, a treatment analysis engine 120, and a client device 130.
  • the machine learning controller 110, the treatment analysis engine 120, and the client device 130 may be communicatively coupled via a network 140.
  • the client device 130 may be a processor-based device including, for example, a workstation, a desktop computer, a laptop computer, a smartphone, a tablet computer, a wearable apparatus, and/or the like.
  • the network 140 may be a wired network and/or a wireless network including, for example, a local area network (LAN), a virtual local area network (VLAN), a wide area network (WAN), a public land mobile network (PLMN), the Internet, and/or the like.
  • LAN local area network
  • VLAN virtual local area network
  • WAN wide area network
  • PLMN public land mobile network
  • the treatment analysis engine 120 may include a machine learning model 125 trained to predict including the likelihood of a patient exhibiting a response and/or a refractory response, to a treatment for a disease.
  • the machine learning model 125 may be trained to predict patient responses based on a feature set 115, which may include a set of biomarker genes. Individual patient response to a particular treatment may be determined based on the gene expression level of each biomarker gene included in the feature set 115. Accordingly, the machine learning model 125 may be trained to predict, based at least on the gene expression level of each biomarker gene associated with a patient, the response that the patient is likely to have to a treatment for a disease.
  • the output of the machine learning model 125 including the predicted responses of one or more patients, may be displayed as a part of a user interface 135 at the client device 130.
  • the machine learning model 125 may be trained to predict patient response to various treatments for rheumatoid arthritis such as conventional disease-modifying antirheumatic drugs (cDMARDs) and biologic rheumatoid arthritis agents.
  • cDMARDs disease-modifying antirheumatic drugs
  • biologic rheumatoid arthritis agents include heat, heat, and water.
  • conventional disease-modifying antirheumatic drugs (cDMARDs) methotrexate, leflunomide, sulfasalazine, hydroxychloroquine, apremilast, azathioprine, ciclosporin, cyclophosphamide, and mycophenolate.
  • Biologic rheumatoid arthritis agents may include, for example, tumor necrosis factor-a (TNF-a) inhibitors, B-cell inhibitors, interleukin inhibitors, selective co-stimulation modulators, and Janus kinase (JAK) inhibitors.
  • biologic rheumatoid arthritis agents include tocilizumab, rituximab, abatacept, adalimumab, anakinra, belimumab, canakinumab, certolizumab, etanercept, golimumab, infliximab, ixekizumab, sarilumab, secukinumab, ustekinumab, baricitinib, tofacitinib, and upadacitinib.
  • the performance of the machine learning model 125 may be contingent upon the features used by the machine learning model 125 to derive its predictions.
  • the machine learning model 125 may require a fewer number of highly relevant features, such as biomarker genes, in order to produce high accuracy predictions.
  • the machine learning controller 110 may be configured to generate the feature set 115 for use by the machine learning model 125 to predict patient responses to a treatment for a disease, such as the likelihood of a rheumatoid arthritis patient exhibiting a response or and/or a refractory response to conventional disease-modifying antirheumatic drugs (cDMARDs), biologic rheumatoid arthritis agents, and/or the like.
  • the machine learning controller 110 may generate the feature set 115 by ensembling features, such as biomarker genes, having a largest impact on the performance of one or more top performing machine learning models applied to a variety of different feature sets, each of which including a different combination of features.
  • FIG. 2A depicts a flowchart illustrating an example of a biomarker selection process 200, in accordance with some example embodiments.
  • the machine learning controller 110 may generate multiple feature sets by randomly partitioning a sequence dataset 215, such as a set of ribonucleic acid (RNA) sequence data, into multiple subsets 225.
  • a sequence dataset 215 such as a set of ribonucleic acid (RNA) sequence data
  • the sequence dataset 215 which may originate from a sequencing platform 210 performing next generation sequencing (NGS)
  • NGS next generation sequencing
  • the machine learning controller 110 may partition the sequence dataset 215 randomly in order to distribute individual features randomly across the resulting subsets 225, thereby minimizing the batch effect that may be present in the sequence dataset 215.
  • the machine learning controller 110 may apply multiple machine learning models, such as an x-quantity of k-fold cross validation (CV) models, to each of the subsets 225. That is, the x-quantity of k-fold cross validation (CV) models (e.g., 32 CV-10 models in the example shown in FIG. 2A) may be applied to predict, based on each of the subsets 225, patient responses to the treatment to the disease. For each subset 225, a y-quantity of the top performing models (e.g., five top performing models in the example shown in FIG. 2A) applied to each of the subsets 225 may be identified.
  • CV x-quantity of k-fold cross validation
  • Model performance in this case may be evaluated based on a variety of performance metrics including, for example, a log-loss metric, a sensitivity, a recall, a precision, a specificity, an accuracy, an area under a receiver operating characteristics (AUC-ROC) curve, and/or the like.
  • performance metrics including, for example, a log-loss metric, a sensitivity, a recall, a precision, a specificity, an accuracy, an area under a receiver operating characteristics (AUC-ROC) curve, and/or the like.
  • the features having the most impact on the performance of the top performing models for each subset 225 may be identified and aggregated to form an ensembled feature set 235.
  • the impact of a feature on the performance of a machine learning model may be evaluated based on a variety of importance metrics including, for example, permutation feature importance (e.g., based on the decrease in model performance), SHapley Additive exPlanations (SHAP) feature importance (e.g., based on magnitude of feature contribution), and/or the like.
  • permutation feature importance e.g., based on the decrease in model performance
  • SHapley Additive exPlanations (SHAP) feature importance e.g., based on magnitude of feature contribution
  • the ensembled feature set 235 may undergo fiirther selection in order to generate the feature set 115 for use by the machine learning model 125 to predict patient response to a treatment for a disease.
  • the machine learning controller 110 may apply the multiple machine learning models, such as the -quantity of k-fold cross validation (CV) models ((e.g., 32 CV-10 models in the example shown in FIG. 2B), to predict, based at least on the ensembled feature set 235, patient responses to the treatment for the disease.
  • CV k-fold cross validation
  • the machine learning controller 110 may again identify, based on various performance metrics such as log-loss, sensitivity, recall, precision, specificity, accuracy, area under a receiver operating characteristics (AUC-ROC) curve, and/or the like, a y-quantity of the top performing models (e.g., five top performing models in the example shown in FIG. 2B) applied to the ensembled feature set 235. Moreover, the machine learning controller 110 may identify, based on various importance metrics such as permutation feature importance, SHapley Additive exPlanations (SHAP) feature importance, and/or the like, one or more features from the ensembled feature set 235 having the most impact on the performance of the machine learning model 125.
  • various performance metrics such as log-loss, sensitivity, recall, precision, specificity, accuracy, area under a receiver operating characteristics (AUC-ROC) curve, and/or the like
  • a y-quantity of the top performing models e.g., five top performing models in the example shown in FIG. 2B
  • the features from the ensembled feature set 235 identified as having the most impact on the performance of the machine learning model 125 may form a candidate feature set 245 that undergoes further selection before being added to the feature set 115.
  • the machine learning controller 110 may apply, to the candidate feature set 245, a multifold generalizability test to identify features that enable the machine learning model 125 to maintain its performance when encountering new, previously unseen data drawn from the same distribution as the one used to create the machine learning model 125.
  • the feature set 115 may include fewer but more relevant features for predicting patient response to a treatment for a particular disease such as conventional disease-modifying antirheumatic drugs (cDMARDs) and biologic rheumatoid arthritis agents for the treatment of rheumatoid arthritis.
  • cDMARDs disease-modifying antirheumatic drugs
  • biologic rheumatoid arthritis agents for the treatment of rheumatoid arthritis.
  • FIGS. 3A-D depict graphs illustrating performance comparisons of the machine learning model 125 operating on different feature sets, in accordance with some example embodiments.
  • the performance of the machine learning model 125 operating on two different feature sets including a first set of biomarker genes selected by applying conventional techniques (Set A) and a second set of biomarker genes selected in accordance with various implementations of the ensembling techniques described herein (Set B), is evaluated based on performance metrics such as an area under a receiver operating characteristics (AUC-ROC) curve, specificity, and sensitivity.
  • the second set of biomarker genes (Set B) includes fewer but more relevant biomarker genes than the first set of biomarker genes (Set A).
  • the second set of biomarker genes (Set B) includes 10 biomarker genes while the first set of biomarker genes (Set A) includes 11 biomarker genes.
  • the second set of biomarker genes (Set B) includes 7 biomarker genes whereas the first set of biomarker genes (Set A) includes 39 biomarker genes.
  • the first set of biomarker genes (Set A) includes 40 biomarker genes while the second set of biomarker genes (Set B) includes 11 biomarker genes.
  • the first set of biomarker genes (Set A) includes 53 biomarker genes while the second set of biomarker genes (Set B) includes 8 biomarker genes.
  • the machine learning model 125 is able to achieve better performance when predicting patient response using the second set of biomarker genes (Set B) then when using the first set of biomarker genes (Set A).
  • FIG. 3 A shows that the machine learning model 125 achieved a 28.3% higher area under a receiver operating characteristics (AUC-ROC) curve when predicting patient response to conventional disease-modifying antirheumatic drug (cDMARDs) using the second set of biomarker genes (Set B) containing the biomarker genes ADRM1, ARL2, ATG10, CCDC73, CDH15, FABP2, ST20, WNT9A, ZCCHC9, and IGLC7.
  • AUC-ROC receiver operating characteristics
  • 3B shows that the machine learning model 125 achieved a 26.4% higher under a receiver operating characteristics (AUC-ROC) curve when predicting patient response to tocilizumab using the second set of biomarker genes (Set B) containing the biomarker genes XCR1, COLGALT2, TCN1, NPIPA3, LCK, HIST2H2AA3, and SHC3.
  • AUC-ROC receiver operating characteristics
  • 3C shows that the machine learning model 125 achieved a 29.7% higher under a receiver operating characteristics (AUC-ROC) curve when predicting patient response to rituximab using the second set of biomarker genes (Set B) containing the biomarker genes AL355102 2, CARMIL1, CT62, DEFA1B, DSCC1, PAGE2, PPIAL4C, RFC5, RRNAD1, SLC6A3, and VCY.
  • AUC-ROC receiver operating characteristics
  • 3D shows that the machine learning model 125 achieved a 33.3% higher under a receiver operating characteristics (AUC-ROC) curve when predicting patient refractory response to tocilizumab using the second set of biomarker genes (Set B) containing the biomarker genes CDC20, COL11A2, CRYBA2, PXYLP1, REC114, SCH3, TCN1, and TUBBl.
  • AUC-ROC receiver operating characteristics
  • FIG. 4 depicts a flowchart illustrating an example of a process 400 for selecting biomarkers for machine learning enabled prediction of treatment response, in accordance with some example embodiments.
  • the process 400 may be performed by the machine learning controller 110, for example, to generate the feature set 115 for use by the machine learning model 125 to predict patient responses to a treatment for a disease, such as the likelihood of a rheumatoid arthritis patient exhibiting a response or and/or a refractory response to conventional disease-modifying antirheumatic drugs (cDMARDs), biologic rheumatoid arthritis agents, and/or the like.
  • cDMARDs disease-modifying antirheumatic drugs
  • the machine learning controller 110 may partition a set of sequence data into a plurality of subsets of sequence data.
  • the machine learning controller 110 may receive the sequence dataset 215, for example, from the sequencing platform 210 performing next generation sequencing (NGS).
  • NGS next generation sequencing
  • the machine learning controller 110 may partition the sequence dataset 215 into multiple subsets 225, which in the example shown in FIG. 2A includes the first subset 225a, the second subset 225b, the third subset 225c, the fourth subset 225d, and the fifth subset 225e.
  • the machine learning controller 110 partition the sequence dataset 215 randomly, which distributes individual features randomly across the resulting subsets 225.
  • the machine learning controller 110 may apply, to each subset of sequence data, a plurality of machine learning models trained to predict a patient response to a treatment for a disease.
  • the machine learning controller 110 may apply an x- quantity of k-fold cross validation (CV) models (e.g., 32 CV-10 models in the example shown in FIG. 2A) to predict, based on each of the subsets 225, patient responses to the treatment to the disease.
  • CV k-fold cross validation
  • the machine learning controller 110 may generate an ensembled feature set including one or more features having a largest impact on a first performance of one or more top performing machine learning models predicting patient response based on each subset of sequence data.
  • the machine learning controller 110 may identify, for each subset 225, a y-quantity of the top performing models (e.g., five top performing models in the example shown in FIG. 2A).
  • model performance may be evaluated based on a variety of performance metrics including, for example, a recall, a precision, a specificity, an accuracy, an area under a receiver operating characteristics (AUC- ROC) curve, and/or the like.
  • the features having the most impact on the performance of the top performing models for each subset 225 may be identified and aggregated to form the ensembled feature set 235.
  • the impact of a feature on the performance of a machine learning model may be evaluated based on a variety of importance metrics including, for example, permutation feature importance (e.g., based on the decrease in model performance), SHapley Additive exPlanations (SHAP) feature importance (e.g., based on magnitude of feature contribution), and/or the like.
  • permutation feature importance e.g., based on the decrease in model performance
  • SHapley Additive exPlanations (SHAP) feature importance e.g., based on magnitude of feature contribution
  • the machine learning controller 110 may apply the plurality of machine learning models to predict, based at least on the ensembled feature set, the patient response to the treatment for the disease.
  • the machine learning controller 110 may subject the ensembled feature set 235 to fiirther selection in order to generate the feature set 115 for use by the machine learning model 125 to predict patient response to a treatment for a disease.
  • the machine learning controller 110 may apply the multiple machine learning models, such as the x-quantity of k-fold cross validation (CV) models (e.g., 32 CV-10 models in the example shown in FIG. 2B), to predict, based at least on the ensembled feature set 235, patient responses to the treatment for the disease.
  • CV k-fold cross validation
  • a -quantity of the top performing models (e.g., five top performing models in the example shown in FIG. 2B) applied to the ensembled feature set 235 may be identified based on various performance metrics such as recall, precision, specificity, accuracy, area under a receiver operating characteristics (AUC-ROC) curve, and/or the like.
  • one or more features from the ensembled feature set 235 having the most impact on the performance of the machine learning model 120 may be identified based on various importance metrics such as permutation feature importance, SHapley Additive exPlanations (SHAP) feature importance, and/or the like.
  • various performance metrics such as recall, precision, specificity, accuracy, area under a receiver operating characteristics (AUC-ROC) curve, and/or the like.
  • AUC-ROC receiver operating characteristics
  • one or more features from the ensembled feature set 235 having the most impact on the performance of the machine learning model 120 may be identified based on various importance metrics such as permutation feature importance, SHapley Additive exPlanations (SHAP) feature importance,
  • the machine learning controller 110 may generate a final feature set including a subset of features from the ensembled feature set having a largest impact on a second performance of one or more top performing machine learning models predicting patient response based on to the ensembled feature set.
  • the features from the ensembled feature set 235 identified as having the most impact on the performance of the machine learning model 125 may form a candidate feature set 245 that undergo further selection before being added to the feature set 115. For instance, in the example shown in FIG.
  • the machine learning controller 110 may apply, to the candidate feature set 245, a multifold generalizability test to identify features that enable the machine learning model 125 to maintain its performance when encountering new, previously unseen data drawn from the same distribution as the one used to create the machine learning model 125.
  • the resulting feature set 115 may include fewer but more relevant features for predicting patient response to a treatment for a particular disease.
  • the feature set 115 may include a panel of biomarker genes that the machine learning model 125 may use to predict whether a rheumatoid arthritis patient will exhibit a response and/or a refractory response to rheumatoid arthritis treatments such as conventional disease-modifying antirheumatic drugs (cDMARDs), biologic rheumatoid arthritis agents, and/or the like.
  • cDMARDs disease-modifying antirheumatic drugs
  • biologic rheumatoid arthritis agents and/or the like.
  • one or more machine learning models may be used.
  • the one or more machine learning models 125 may include one or more linear regression models, logistic regression models, gradient boosting models, random forest models, keras neural networks, neural networks, deep learning models, generalized linear models, light gradient boosting classifiers, extreme gradient boosting classifiers, elastic net classifiers, or the like.
  • the plurality of machine learning models may include machine learning models with similar architectures and but various parameters.
  • the process illustrated in FIG. 4 can be applied to applications such as rheumatoid arthritis, systemic lupus, systemic lupus erythematosus (SLE), osteoarthritis, and/or connective tissue diseases.
  • sequence data may include proteomics data and/or RNA data.
  • the process illustrated in FIG. 4 can be used for prediction, diagnosis, and/or monitoring applications.
  • FIG. 5 depicts a block diagram illustrating an example of computing system 500, in accordance with some example embodiments.
  • the computing system 500 may be used to implement the machine learning controller 110, the treatment analysis engine 120, the client device 130, and/or any components therein.
  • the computing system 500 can include a processor 510, a memory 520, a storage device 530, and input/output devices 540.
  • the processor 510, the memory 520, the storage device 530, and the input/output devices 540 can be interconnected via a system bus 550.
  • the processor 510 is capable of processing instructions for execution within the computing system 500. Such executed instructions can implement one or more components of, for example, the machine learning controller 110, the treatment analysis engine 120, the client device 130, and/or the like.
  • the processor 510 can be a single-threaded processor. Alternately, the processor 510 can be a multi -threaded processor.
  • the processor 510 is capable of processing instructions stored in the memory 520 and/or on the storage device 530 to display graphical information for a user interface provided via the input/output device 540.
  • the memory 520 is a computer readable medium such as volatile or non-volatile that stores information within the computing system 500.
  • the memory 520 can store data structures representing configuration object databases, for example.
  • the storage device 530 is capable of providing persistent storage for the computing system 500.
  • the storage device 530 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means.
  • the input/output device 540 provides input/output operations for the computing system 500.
  • the input/output device 540 includes a keyboard and/or pointing device.
  • the input/output device 540 includes a display unit for displaying graphical user interfaces.
  • the input/output device 540 can provide input/output operations for a network device.
  • the input/output device 540 can include Ethernet ports or other networking ports to communicate with one or more wired and/or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).
  • LAN local area network
  • WAN wide area network
  • the Internet the Internet
  • the computing system 500 can be used to execute various interactive computer software applications that can be used for organization, analysis and/or storage of data in various formats.
  • the computing system 500 can be used to execute any type of software applications.
  • These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and/or any other objects, etc.), computing functionalities, communications functionalities, etc.
  • the applications can include various add-in functionalities or can be standalone computing products and/or functionalities.
  • the functionalities can be used to generate the user interface provided via the input/output device 540.
  • the user interface can be generated and presented to a user by the computing system 500 (e.g., on a computer screen monitor, etc.).
  • One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs, field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof.
  • These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • the programmable system or computing system may include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium.
  • the machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example, as would a processor cache or other random access memory associated with one or more physical processor cores.
  • one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer.
  • a display device such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user
  • LCD liquid crystal display
  • LED light emitting diode
  • a keyboard and a pointing device such as for example a mouse or a trackball
  • feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • Other possible input devices include touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive track pads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.
  • phrases such as “at least one of’ or “one or more of’ may occur followed by a conjunctive list of elements or features.
  • the term “and/or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features.
  • the phrases “at least one of A and B;” “one or more of A and B;” and “A and/or B” are each intended to mean “A alone, B alone, or A and B together.”
  • a similar interpretation is also intended for lists including three or more items.
  • the phrases “at least one of A, B, and C;” “one or more of A, B, and C;” and “A, B, and/or C” are each intended to mean “A alone, B alone, C alone, A and B together, A and C together, B and C together, or A and B and C together.”
  • Use of the term “based on,” above and in the claims is intended to mean, “based at least in part on,” such that an unrecited feature or element is also permissible.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Primary Health Care (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Pathology (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Un procédé peut comprendre le partitionnement d'un ensemble de données de séquence en une pluralité de sous-ensembles de données de séquence, comprenant chacun un ensemble différent de caractéristiques comprenant des gènes de biomarqueur. Une pluralité de modèles d'apprentissage automatique peuvent être appliqués à chaque sous-ensemble de données de séquence. Chaque modèle d'apprentissage automatique peut être entraîné pour prédire, sur la base du sous-ensemble de données de séquence, une réponse de patient à un traitement d'une maladie. Un ensemble de caractéristiques regroupées peut être généré pour comprendre des caractéristiques ayant un impact le plus grand sur une première performance de modèles d'apprentissage automatique les plus performants appliqués à chaque sous-ensemble de données de séquence. La pluralité de modèles d'apprentissage automatique peuvent être appliqués pour prédire, sur la base de l'ensemble de caractéristiques regroupées, la réponse du patient au premier traitement de la maladie.
PCT/US2023/076606 2022-10-12 2023-10-11 Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement WO2024081737A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263415500P 2022-10-12 2022-10-12
US63/415,500 2022-10-12

Publications (1)

Publication Number Publication Date
WO2024081737A1 true WO2024081737A1 (fr) 2024-04-18

Family

ID=90670324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/076606 WO2024081737A1 (fr) 2022-10-12 2023-10-11 Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement

Country Status (1)

Country Link
WO (1) WO2024081737A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180358132A1 (en) * 2017-06-13 2018-12-13 Alexander Bagaev Systems and methods for identifying cancer treatments from normalized biomarker scores
US20210280271A1 (en) * 2019-06-27 2021-09-09 Scipher Medicine Corporation Developing classifiers for stratifying patients

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180358132A1 (en) * 2017-06-13 2018-12-13 Alexander Bagaev Systems and methods for identifying cancer treatments from normalized biomarker scores
US20210280271A1 (en) * 2019-06-27 2021-09-09 Scipher Medicine Corporation Developing classifiers for stratifying patients

Similar Documents

Publication Publication Date Title
Iniesta et al. Machine learning, statistical learning and the future of biological research in psychiatry
Linden et al. Modeling time‐to‐event (survival) data using classification tree analysis
Ben-Ari Fuchs et al. GeneAnalytics: an integrative gene set analysis tool for next generation sequencing, RNAseq and microarray data
Dingen et al. RegressionExplorer: Interactive exploration of logistic regression models with subgroup analysis
Le-Rademacher et al. Application of multi-state models in cancer clinical trials
US20210342212A1 (en) Method and system for identifying root causes
Boyce et al. Bridging islands of information to establish an integrated knowledge base of drugs and health outcomes of interest
CN105144178A (zh) 用于临床决策支持的系统和方法
US20160117470A1 (en) Personalized medicine service
Gobbel et al. Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives
Maier et al. Reinforcement learning and Bayesian data assimilation for model‐informed precision dosing in oncology
Luo et al. Using machine learning approaches to predict high-cost chronic obstructive pulmonary disease patients in China
Chapfuwa et al. Enabling counterfactual survival analysis with balanced representations
Cockrell et al. Nested active learning for efficient model contextualization and parameterization: pathway to generating simulated populations using multi-scale computational models
Gudin et al. Reducing opioid prescriptions by identifying responders on topical analgesic treatment using an individualized medicine and predictive analytics approach
Nicholls et al. Comparison of sparse biclustering algorithms for gene expression datasets
US11354591B2 (en) Identifying gene signatures and corresponding biological pathways based on an automatically curated genomic database
Bayramli et al. Predictive structured–unstructured interactions in EHR models: A case study of suicide prediction
Shen et al. Estimating the optimal personalized treatment strategy based on selected variables to prolong survival via random survival forest with weighted bootstrap
Babayoff et al. Surgery duration: Optimized prediction and causality analysis
WO2024081737A1 (fr) Sélection de biomarqueur pour une prédiction activée par apprentissage automatique d'une réponse au traitement
Mandal et al. Reconstruction of dominant gene regulatory network from microarray data using rough set and bayesian approach
Galimzhanov et al. Prediction of clinical outcomes after percutaneous coronary intervention: Machine-learning analysis of the National Inpatient Sample
US20210174912A1 (en) Data processing systems and methods for repurposing drugs
Kanyongo et al. Machine learning approaches to medication adherence amongst NCD patients: A systematic literature review

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878196

Country of ref document: EP

Kind code of ref document: A1