WO2024102327A1 - Utilisation de dossiers électroniques de santé rares pour prédire un résultat de santé - Google Patents

Utilisation de dossiers électroniques de santé rares pour prédire un résultat de santé Download PDF

Info

Publication number
WO2024102327A1
WO2024102327A1 PCT/US2023/036851 US2023036851W WO2024102327A1 WO 2024102327 A1 WO2024102327 A1 WO 2024102327A1 US 2023036851 W US2023036851 W US 2023036851W WO 2024102327 A1 WO2024102327 A1 WO 2024102327A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
features
time
subjects
machine learning
Prior art date
Application number
PCT/US2023/036851
Other languages
English (en)
Inventor
Istvan Bartha
Michael Cyrus Riley MAHER
Amalio Telenti
Original Assignee
Humabs Biomed Sa
Vir Biotechnology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Humabs Biomed Sa, Vir Biotechnology, Inc. filed Critical Humabs Biomed Sa
Publication of WO2024102327A1 publication Critical patent/WO2024102327A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/20ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu

Definitions

  • EHR Electronic health data
  • processing EHR data of subjects with varying data sparsity can involve generating features across a plurality of time bins, such that every time bin, or almost every time bin, includes at least a feature value.
  • trained machine learning models can analyze features that are present across the time bins, thereby avoiding the issues that often arise due to data sparsity.
  • Machine learning models can accurately predict health outcomes, examples of which include health outcomes of concern, such as hospitalization or severe disease outcomes.
  • methods disclosed herein are useful for predicting health outcomes related to infectious diseases.
  • EHR data of varying data sparsity are processed and analyzed using machine learning models to accurately predict likely hospitalization or severe disease outcomes due to infection by an infectious disease.
  • a method for assessing risk of hospitalization of a subject due to an infectious disease comprising: obtaining or having obtained electronic health record (EHR) data with varying data sparsity for a plurality of subjects; for a subject, generating time- binned features across a plurality of time bins, the time-binned features lacking the data sparsity from the obtained EHR data; for each of one or more of the plurality of subjects: analyzing, using a machine learning model, the time-binned features lacking the data sparsity to generate a prediction for the subject, wherein the machine learning model treats data missingness patterns in EHR data as a feature; and determining whether one or more of the plurality of subjects are at risk of hospitalization according to the predictions for the plurality of subjects.
  • EHR electronic health record
  • methods disclosed herein further comprise: prior to analyzing using the machine learning model, collapsing one or more time-binned features within a time bin.
  • collapsing one or more time-binned features within a time bin comprises combining features with similar attributes.
  • the similar attributes comprises a common disease hierarchy or a common drug type.
  • the method does not involve data imputation.
  • the features comprise features representing patterns of use of health resources comprising one or more of a predicted volume of EHR records after a hospitalization or outpatient visit that is predicted from available EHR data, or time between a date of a first hospital or outpatient visit and date of a subsequent hospitalization or outpatient visit.
  • the features comprise a ratio of neutrophils over lymphocytes.
  • methods disclosed herein further comprise: prior to generating time-binned features, harmonizing the EHR data obtained from a plurality of sources.
  • harmonizing the EHR data comprises harmonizing disparate data using any of diagnostic codes or laboratory measurements with an aligned unit system.
  • harmonizing the EHR data comprises performing count-based integration of related measurements.
  • the machine learning model further analyzes an additional feature representing outputted predictions for a different task.
  • the different task comprises a prediction for subjects who experience a severe disease outcome.
  • the machine learning model achieves at least 80% accuracy for hospitalized subjects. In various embodiments, the machine learning model achieves at least 85% accuracy for hospitalized subjects.
  • the plurality of subjects comprise pediatric subjects. In various embodiments, the plurality of subjects comprise subjects previously diagnosed with the infectious disease. In various embodiments, the infectious disease is COVID-19 or influenza.
  • analyzing, using the machine learning model, the time-binned features comprises analyzing, using the machine learning model, values for time-binned features comprising geographical location or administrative settings. In various embodiments, the time- binned features comprising geographical location or administrative settings more heavily contribute towards the prediction for the subject in comparison to other features analyzed by the machine learning model. In various embodiments, the machine learning model analyzes values for at least 100, at least 500, at least 1000, at least 1500, or at least 2000 features.
  • the one or more time bins comprise three or more time bins.
  • a first time bin represents between 0 and 4 intervals prior to a hospitalization date.
  • a first time bin represents between 4 and 8 intervals prior to a hospitalization date.
  • a first time bin represents greater than 8 intervals prior to a hospitalization date.
  • methods disclosed herein further comprise: for one or more of the plurality of subjects who are determined to be at risk of hospitalization, providing care to the one or more of the plurality of subjects.
  • EHR electronic health record
  • a method for assessing risk of a severe disease outcome of a subject due to an infectious disease comprising: obtaining or having obtained electronic health record (EHR) data for a plurality of subjects; for a subject, generating time- binned features across a plurality of time bins, the time-binned features lacking the data sparsity from the obtained EHR data; for each of one or more of the plurality of subjects: analyzing, using a machine learning model, the time-binned features lacking the data sparsity to generate a prediction for the subject, wherein the machine learning model treats data missingness patterns in EHR data as a feature; and determining whether one or more of the plurality of subjects are at risk of a severe disease outcome according to the predictions for the plurality of subjects.
  • EHR electronic health record
  • methods disclosed herein further comprise: prior to analyzing using the machine learning model, collapsing one or more time-binned features within a time bin.
  • collapsing one or more time-binned features within a time bin comprises combining features with similar attributes.
  • the similar attributes comprises a common disease hierarchy or a common drug type.
  • the method does not involve data imputation.
  • the features comprise features representing patterns of use of health resources comprising one or more of a predicted volume of EHR records after a hospitalization or outpatient visit that is predicted from available EHR data, or time between a date of a first hospital or outpatient visit and date of a subsequent hospitalization or outpatient visit.
  • the features comprise a ratio of neutrophils over lymphocytes.
  • methods disclosed herein further comprise: prior to generating time- binned features, harmonizing the EHR data across the plurality of subjects.
  • harmonizing the EHR data comprises harmonizing disparate data using any of diagnostic codes or laboratory measurements with an aligned unit system.
  • harmonizing the EHR data comprises performing count-based integration of related measurements.
  • the machine learning model further analyzes an additional feature representing outputted predictions for a different task.
  • the different task comprises a prediction for subjects who are hospitalized.
  • the machine learning model achieves at least 80% accuracy for subjects who experience severe outcomes.
  • the machine learning model achieves at least 87% accuracy for subjects who experience severe outcomes
  • the plurality of subjects comprise pediatric subjects.
  • the plurality of subjects comprise subjects previously diagnosed with the infectious disease.
  • the infectious disease is COVID-19 or influenza.
  • analyzing, using the machine learning model, the updated EHR data comprises analyzing, using the machine learning model, values for features comprising one or more of albumin levels, oxygen saturation, gestation, BMI, and pre-existing chronic disease of the heart and the kidney.
  • the features comprising one or more of albumin levels, oxygen saturation, gestation, BMI, and pre-existing chronic disease of the heart and the kidney more heavily contribute towards the prediction for the subject in comparison to other features analyzed by the machine learning model.
  • the machine learning model analyzes values for at least 100, at least 500, at least 1000, at least 1500, or at least 2000 features.
  • the one or more time bins comprise three or more time bins.
  • a first time bin represents between 0 and 8 days before an outpatient visit.
  • a second time bin represents between 8 and 32 days before an outpatient visit.
  • a third time bin represents greater than 32 days before an outpatient visit.
  • methods disclosed herein further comprise: for one or more of the plurality of subjects who are determined to be at risk of a severe disease outcome, providing care to the one or more of the plurality of subjects.
  • FIG. 1A depicts an overall system environment for predicting health outcomes using electronic health record data of subjects, in accordance with an embodiment.
  • FIG. IB is an example block diagram of a data sparsity prediction system, in accordance with an embodiment.
  • FIG. 2A depicts example features from electronic health record data with varying data sparsity, in accordance with an embodiment.
  • FIG. 2B depicts an example time binning of features, in accordance with an embodiment.
  • FIG. 2C depicts example collapsed features across time bins, in accordance with an embodiment.
  • FIG. 3A is an example block diagram for implementing a machine learning model, in accordance with an embodiment.
  • FIG. 3B is an example flow diagram for predicting health outcomes using electronic health record data, in accordance with an embodiment.
  • FIG. 4 illustrates an example computing device for implementing methods and systems described in FIGs. 1A, IB, 2A, 2B, 2C, and 3.
  • FIGs. 5 A and 5B depicts prediction of cardiovascular interventions (FIG. 5 A) and respiratory interventions (FIG. 5B) up to 4 days before the outcome.
  • FIG. 6 depicts the performance (e.g., as measured by receiver operating curve (ROC) area) as well as enrichment over baseline of the various classifiers.
  • ROC receiver operating curve
  • FIGs. 7A and 7B show the most influential procedure and diagnostic codes, respectively.
  • subject “individual,” or “patient” are used interchangeably and encompass a cell, tissue, organism, human or non-human, mammal or non-mammal, male or female, whether in vivo, ex vivo, or in vitro.
  • different subjects can be human or non- human, and as such, the generation and use of universal signatures, as described herein, can be generated and/or deployed for both human and non-human subjects.
  • the term “disease” generally refers to an infectious disease.
  • An infectious disease can be caused by infection of a subject with a virus, such as for example, severe acute respiratory syndrome coronavirus 1 (SARS-CoV-1), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), an influenza virus, Ebola virus, human immunodeficiency virus (HIV), Hepatitis B virus (HBV), Hepatitis C virus (HCV), Human papillomavirus (HPV), or herpes simplex virus (HSV).
  • the disease is severe acute respiratory syndrome (SARS) caused by a SARS-CoV-1 virus.
  • the disease is coronavirus disease 2019 (COVID-19) caused by a SARS-CoV-2 virus.
  • the disease is influenza caused by an influenza virus (e.g., Influenza A virus, influenza B virus, influenza C virus, or influenza D virus).
  • influenza virus e.g., Influenza A virus, influenza B virus, influenza C virus, or influenza D virus.
  • An infectious disease can be caused by infection of a subject with a bacteria, such as for example, Mycobacterium tuberculosis .
  • sample or “test sample” can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, such as a blood sample, taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art.
  • obtaining electronic health record data or “obtaining EHR data” encompasses obtaining EHR data from one or more subjects.
  • Obtaining EHR data encompasses obtaining EHR data directly from one or more subject (e.g., communicating and/or questioning the subject).
  • the phrase also encompasses receiving EHR data, e.g., from a third party that has obtained the EHR data from one or more subjects.
  • time-binned features that lack data sparsity refers to features for a subject that are processed from EHR data with varying data sparsity, wherein the features are informative for predicting health outcomes.
  • time-binned features that lack data sparsity refer to features across a plurality of time bins, such that at least a threshold number of time bins have at least one feature value.
  • time-binned features that lack data sparsity refer to features across a plurality of time bins, such that every time bin has at least one feature value.
  • EHR electronic health record
  • EMR electronic medical records
  • FIG. 1A depicts an overall system environment for predicting health outcomes using electronic health record (EHR) data of subjects, in accordance with an embodiment.
  • EHR electronic health record
  • FIG. 1A introduces subjects 110.
  • the subjects 110 include subjects with an age of at least 18 years old.
  • the subjects 110 include subjects with an age of at least 21 years old, at least 25 years old, at least 30 years old, at least 35 years old, at least 40 years old, at least 45 years old, at least 50 years old, at least 55 years old, or at least 60 years old.
  • the subjects 110 include pediatric subjects.
  • pediatric subjects are subjects that are less than 18 years old.
  • pediatric subjects are subjects that are less than 17 years old, less than 16 years old, less than 15 years old, less than 14 years old, less than 13 years old, less than 12 years old, less than 11 years old, less than 10 years old, less than 9 years old, less than 8 years old, less than 7 years old, less than 6 years old, less than 5 years old, less than 4 years old, less than 3 years old, less than 2 years old, or less than 1 year old.
  • pediatric subjects are between 0 and 18 years old, between 0 and 17 years old, between 0 and 16 years old, between 0 and 15 years old, between 0 and 14 years old, between 0 and 13 years old, or between 0 and 12 years old.
  • FIG. 1A shows three subjects 110, in various embodiments there may be additional subjects 110 (e.g., at least 10, at least 50, at least 100, at least 500, at least 1000, at least 10,000, at least 100,000, or at least 1 million subjects).
  • additional subjects 110 e.g., at least 10, at least 50, at least 100, at least 500, at least 1000, at least 10,000, at least 100,000, or at least 1 million subjects.
  • one or more of the subjects 110 were previously diagnosed with a disease, such as an infectious disease.
  • the overall system environment shown in FIG. 1A is useful for predicting health outcomes for patients due to the diagnosed disease.
  • one or more subjects 110 have not been diagnosed with a disease, such as an infectious disease.
  • the overall system environment shown in FIG. 1A can be useful for predicting health outcomes related to a disease for not-yet diagnosed subjects.
  • the subjects may have been recently exposed to an infectious disease, but have not yet exhibited symptoms.
  • the predicted health outcome can represent a future outcome for the subject exposed to the infectious disease.
  • EHR data 120 are collected from the subjects 110.
  • EHR data 120 refers to medical records of subjects 110.
  • Examples of EHR data 120 can include, but are not limited to: patient demographics information, prior patient diagnoses, medical history, prior laboratory tests and results, immunization records, written clinical noes, standardized medical conditions (ICT/SNOMED), and socioeconomic variables (e.g., zip code, health insurance plan).
  • the EHR data 120 may be collected by a medical professional.
  • the EHR data 120 is stored and maintained in a database.
  • Example databases for storing EHR data 120 can include any of EPIC Healthcare Software, EBS PathoSof, HxRx Healthcare Management System, Healcon Practice, Drug Inventory Management System (DIMS), oeHealth, Patientpop, Webptis, GeBBS HIM Solutions, Cemer, WebPT, eClinicalWorks, and NextGen Healthcare EHR.
  • the EHR data 120 may be obtained from subjects 110 at various timepoints.
  • EHR data 120 may be collected from a subject 110 during a visit to a medical professional (e.g., a visit to a hospital, to a medical laboratory, and/or to a doctor’s office).
  • a medical professional e.g., a visit to a hospital, to a medical laboratory, and/or to a doctor’s office.
  • the timepoints at which EHR data is collected from a first subject may differ from the timepoints at which EHR data is collected from a second subject.
  • EHR data 120 may be obtained from different subjects 110 at varying frequencies.
  • EHR data 120 may be collected from a first subject at a higher frequency (e.g., because the first subject frequently visits a medical professional) in comparison to a second subject (e.g., because the second subject visits a medical professional less frequently).
  • EHR data 120 is collected from subjects 110 at various timepoints and at varying frequencies, this leads to the presence of data sparsity across the EHR data 120.
  • Data sparsity refers to the presence of gaps that in the EHR data 120 that can render analysis of the EHR data 120 difficult.
  • the data sparsity prediction system 130 analyzes EHR data 120 for on a per-subject basis to determine a predicted health outcome 140 for each subject 110.
  • the data sparsity prediction system 130 processes the EHR data 120 to reduce or remove the data sparsity present in the EHR data 120.
  • the data sparsity prediction system 130 can extract features from data that are timestamped with various timepoints and can collapse features across one or more time bins to remove or reduce the data sparsity.
  • the data sparsity prediction system 130 implements trained machine learning models to analyze the updated features in which the data sparsity is reduced or removed and to generate a predicted health outcome 140. The methods performed by the data sparsity prediction system 130 are described in further detail in.
  • the predicted health outcome 140 generally refers to an outcome for a subject 110 due to an infectious disease.
  • An infectious disease can be caused by infection of a subject with a pathogen, such as any one of a virus, bacteria, fungus, or a parasite.
  • the pathogen is any one of: severe acute respiratory syndrome-related coronavirus (SARS), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), influenza, Ebola virus, human immunodeficiency virus (HIV), Hepatitis B virus (HBV), Hepatitis C virus (HCV), Human papillomavirus (HPV), Mycobacterium tuberculosis, herpes simplex virus infection (HSV), or influenza virus (e.g., Influenza A virus, influenza B virus, influenza C virus, or influenza D virus).
  • SARS severe acute respiratory syndrome-related coronavirus
  • SARS-CoV-2 severe acute respiratory syndrome coronavirus 2
  • influenza Ebola virus
  • HBV Hepatitis B virus
  • HCV Hepatitis C virus
  • HPV Human papillomavirus
  • HSV herpes simplex virus infection
  • influenza virus e.g., Influenza A virus, influenza B virus, influenza C virus, or influenza D virus.
  • the predicted health outcome 140 can refer to a likelihood of hospitalization of the subject due to the infectious disease.
  • the predicted health outcome 140 can refer to a likelihood of a severe disease outcome for the subject due to the infectious disease.
  • a severe disease outcome can include any of hospitalization, admission to an intensive care unit (ICU), a need for ventilation, and death.
  • ICU intensive care unit
  • a severe disease outcome can be defined according to any of the following: dyspnea, a respiratory rate of 30 or more breaths per minute, a blood oxygen
  • a medical professional can provide appropriate care to the corresponding subject 110 to address the predicted health outcome 140. For example, if the predicted health outcome 140 for a subject 110 indicates that the subject is likely to experience severe disease or hospitalization in the future, the subject 110 can be provided a therapy to avoid the likely severe disease or hospitalization. For example, in the context of SARS-CoV-2 infections, treatment efficacy may be dependent on how early the treatment is started (e.g., at the onset of symptoms or even prior to symptom development). Thus, by determining that a subject 110 is likely to experience severe COVID-19 disease and/or experience hospitalization, the subject 110 can be provided a therapy at an early enough time to avoid the outcome.
  • methods performed by a data sparsity prediction system can involve analyzing EHR data for on a per-subject basis to determine a predicted health outcome for each subject.
  • methods performed by the data sparsity prediction system can involve reducing data sparsity in EHR data and implementing trained machine learning models to analyze features of reduced data sparsity from the EHR data.
  • the trained machine learning models can predict likely health outcomes for subjects (e.g., pediatric subjects diagnosed with an infectious disease).
  • FIG. IB shows an example block diagram of a data sparsity prediction system 130, in accordance with an embodiment.
  • FIG. IB introduces individual modules of the data sparsity prediction system 130, which in various embodiments includes a data harmonization module 150, a feature processing module 160, a model training module 180, and a model deployment module 190.
  • the modules of the data sparsity prediction system 130 may be differently configured.
  • the model training module 180 which is responsible for training machine learning models, may not be embodied as part of the data sparsity prediction system 130 and instead, can be employed by a third party.
  • the model training module 180 need not be present in the data sparsity prediction system 130 as shown in FIG. IB.
  • the data harmonization module 150 accesses EHR data from one or more sources and performs data harmonization to ensure that the EHR data from the one or more sources are unified.
  • the data harmonization module 150 can access EHR data from multiple databases. Unharmonized EHR data contain records which use a diverse set of notations for the same concepts. Such a notational diversity across different EHR data may render subsequent analysis (e.g., feature processing and/or automatic pattern recognition processes by machine learning models) difficult.
  • the data harmonization module 150 aligns the EHR data from various sources to generate one or more harmonized tables of EHR data.
  • the harmonized tables of EHR data can be subsequently analyzed e.g., for feature extraction.
  • the data harmonization module 150 differently harmonizes EHR data from various sources according to the types of values in the EHR data. For example, for EHR data of discrete values, the data harmonization module 150 may employ a lookup table that identifies discrete values of different sources. Thus, using such a lookup table, discrete values from one source can be matched up to discrete values from another source.
  • the data harmonization module 150 may perform a transformation of the continuous values to harmonize the data.
  • the transformation may be a function that transforms the data values of a first source to align with the data values of a second source.
  • the one or more harmonized tables of EHR data include: an insurance table, an observation table, a measurement table, a conditions table, a drug table, a visit occurrences table, and a feature concept table.
  • each of the tables includes EHR data records that have been harmonized across from multiple sources or databases.
  • the insurance table includes EHR data related to insurers, payers, and payees.
  • the each row of the insurance table is a contiguous period of insurance coverage, where attributes describe the contiguous period e.g. attributes include subject, time frame, insurance type, and feature concept.
  • the observation table includes EHR data identifying observations recorded for subjects e.g.., on one or more prior visits (to the hospital or to outpatient center). For example, the observation table can include information of written notes from a medical professional who engaged with a subject during a visit.
  • each row of the observation table is an observation, where attributes include timestamp, subject reference, observation concept, observation value, observation unit concept (in what unit is the value given).
  • the measurements table refers to laboratory measurements that were captured for a subject.
  • Example laboratory measurements may be, for example, measurements following a blood test.
  • the measurement table may be schematically analogous to the observation table, with the difference that the measurement table identifies defined medical measurements as opposed to observations.
  • the conditions table refers to EHR data related to prior medical history of subjects. For example, if a subject was previously diagnosed with a particular condition or disease, the conditions table would reflect that prior diagnosis.
  • each row of the conditions table is a medical diagnosis. Attributes of the conditions table can include subject reference, time stamp, and concept reference.
  • the drug table refers to EHR data related to prior prescriptions and/or drugs that were provided to subjects.
  • each row of the drug table is a drug prescription.
  • Attributes of the drug table include subject, timestamp, drug concept reference, or dosage.
  • the visit occurrences table may identify when prior visits were made by subjects e.g., hospital visits or visits to outpatient center.
  • attributes of the visit occurrence table include subject, timestamp, and details of a visit (e.g., hospital/ambulatory/telemedicine/pharmacy).
  • the feature concept table includes identification of feature concepts, which represent overarching feature attributes for which individual features are categorized.
  • the feature concept table includes a lookup table from concept identifiers (integers) and short textual descriptions.
  • occurrences of textual terms are replaced with a reference into this feature concept table. Insertion of new concepts are reviewed to avoid multiple records in the feature concept table referring to the same actual semantic concept.
  • Prevention of synonymous and near synonymous usage of terms can be the point of such a controlled vocabulary, i.e., to describe the same or similar concepts with the same terms so that later on in data analysis, all occurrences of the same semantic concept can be found and appropriately aggregated.
  • the feature processing module 160 further processes the harmonized EHR data to generate features that are informative for predicting health outcomes.
  • the feature processing module 160 processes the harmonized EHR data to generate time-binned features that lack data sparsity.
  • machine learning models would typically have difficulty in analyzing information with data sparsity, by generating time-binned features that lack data sparsity, machine learning models can appropriately analyze the time- binned features to generate a predicted health outcome for a subject.
  • the feature processing module 160 can further generate an additional feature, herein referred to as a data missingness feature, which captures information of the data sparsity of the EHR data.
  • the feature processing module 160 reduces or removes the data sparsity information from the time-binned features
  • further generating a data missingness feature can, in various embodiments, re-supply the data sparsity information such that the machine learning model can consider the data sparsity information in its analysis.
  • the data missingness feature for an individual represents absence of EHR data.
  • absence of EHR data can indicate that the individual did not undergo a corresponding event e.g., an observation, such as a laboratory measurement or clinical diagnosis.
  • absence of EHR data can be indicative or informative of the individual’s healthy status during the period of data missingness in the individual’s EHR data.
  • the machine learning model can account for the individual’s healthy status during the period of data missingness.
  • the EHR data may be thought of as a sequence of time stamped events where each event corresponds to an observation e.g. a laboratory measurement or a clinical diagnosis, at which the data was gathered and/or recorded.
  • an observation e.g. a laboratory measurement or a clinical diagnosis
  • the totality of a subject’s observations may be very different in composition between patients, i.e., certain data will be present for a patient while missing for another.
  • the feature processing module 160 analyzes EHR data for a subject and extracts relevant features from the EHR data.
  • the EHR includes a sequence of time stamped events
  • the corresponding extracted features from the EHR data may similarly be time stamped according to the timing of the EHR data. For example, assuming an EHR datapoint of a blood draw at a particular timepoint, the corresponding feature may be a laboratory measurement from the blood at that particular timepoint.
  • Example features include, but are not limited to, patient demographics information (e.g., age, race, ethnicity), socioeconomic features (e.g., geographic location such as zip codes, salary), insurance information (e.g., payer plan, payer identifier, insurance identifier), one or more prior clinical diagnosis (e.g., diagnosis codes), one or more prior drugs (e.g., drugs prescribed to the subject), and one or more prior laboratory measurements (e.g., neutrophil levels, lymphocyte levels).
  • patient demographics information e.g., age, race, ethnicity
  • socioeconomic features e.g., geographic location such as zip codes, salary
  • insurance information e.g., payer plan, payer identifier, insurance identifier
  • prior clinical diagnosis e.g., diagnosis codes
  • prior drugs e.g., drugs prescribed to the subject
  • prior laboratory measurements e.g., neutrophil levels, lymphocyte levels
  • features extracted from the EHR data are quantified record features which represent quantified number of records in one or more tables holding the harmonized EHR data.
  • quantified records features can include a total number of records in an insurance table, total number of records in an observation table, a total number of records in a measurement table, a total number of records in a conditions table, a total number of records in a drug table, or a total number of records in a visit occurrences table.
  • the quantified records features is a total number of records in one or more feature concepts, where each feature concept includes at least 3 related features with corresponding feature values.
  • the quantified record features can be extracted from the EHR data by implementing a machine learning model.
  • a machine learning model referred to as a "‘records machine learning model,” can be trained to predict records after a particular cutoff date (e.g., hospitalization/outpatient visit).
  • the records machine learning model is different from the machine learning model described herein for predicting a health outcome.
  • the records machine learning model is trained using segmented training data, where the training data is segmented based on cutoff dates (e.g., hospitalization/outpatient visits). Therefore, by analyzing records that occur before a cutoff date, the records machine learning model can predict one or more records that occur after the cutoff date.
  • the quantified record features can represent a total number of records in any of the aforementioned tables after the cutoff date (hospitalization/outpatient visit) that was predicted from the data available before the cutoff date.
  • features extracted from the EHR data are time interval features which represent time intervals between two timepoints in the harmonized EHR data.
  • a time interval feature can include a time between a covid index (e.g., date of first COVID related visit) and a date of outpatient visit or hospitalization, a number of days between a cutoff date (e.g., a date of outpatient visit or hospitalization) and a first record available in the EHR data, a number of days between a cutoff date and last record before the cutoff date, a number of days between a cutoff date and last record before the cutoff date.
  • covid index e.g., date of first COVID related visit
  • a cutoff date e.g., a date of outpatient visit or hospitalization
  • the feature processing module 160 can further generate updated features, which represent combinations of any of the aforementioned features.
  • One example of an updated feature is a ratio of values in prior laboratory measurements.
  • an updated feature can be a ratio of neutrophils over lymphocytes from prior laboratory measurements.
  • the feature processing module 160 represents the features, such as the exemplary features described above, on a timeline.
  • the feature processing module 160 reduces the timeline into a plurality of non-overlapping time bins, such that the features are categorized into corresponding time bins according to their timestamps.
  • the time bins represent fixed window time bins, where each time bin may have a different time window.
  • FIG. 2A depicts example features from electronic health record data with varying data sparsity, in accordance with an embodiment.
  • FIG. 2A depicts features for two patients (e.g., patient A and patient B).
  • the feature processing module 160 may generate example features from EHR data for more patients.
  • the feature processing module 160 may extract features 1A, 2A, 3 A, 4A, 5 A, and 6A for patient A as well as features IB, 2B, 3B, 4B, and 5B for patient B.
  • the features for each patient are located across the timeline at various timepoints, thereby resulting in the presence of data sparsity.
  • data sparsity 210A is shown for patient A as arising due to a gap between feature 3A and feature 4A.
  • Data sparsity 210B is shown as arising from a gap between feature 4A and feature 5 A.
  • Data sparsity 210C is shown as arising from a gap present prior to feature IB.
  • Data sparsity 210D is shown as arising from a gap present between features IB and 2B.
  • Data sparsity 210E is shown as arising from a gap present between features 3B and 4B.
  • the feature processing module 160 generates time bins, thereby categorizing features into time bins to generate time-binned features that lack data sparsity. In various embodiments, the feature processing module 160 generates time bins according to a cutoff event in the EHR data.
  • An example cutoff event can be an outpatient visit or a hospitalization. In various embodiments, if the predicted task of the machine learning model is to predict severe disease outcome, the cutoff event is an outpatient visit. If the predicted task of the machine learning model is to predict likely hospitalization, the cutoff even can be a hospitalization.
  • a first time bin represents a time interval from 0 days up to X days before the cutoff event.
  • a second time bin represents a time interval from X days before the cutoff event to Y days before the cutoff event.
  • a third time bin represents a time interval from Y days before the cutoff event to Z days before the cutoff event.
  • X is any one of 1 day, 2 days, 3 days, 4 days, 5 days, 6 days, 7 days, 8 days, 9 days, 10 days, 11 days, 12 days, 13 days, or 14 days.
  • X i s 8 days.
  • Y is any one of 5 days, 6 days, 7 days, 8 days, 9 days, 10 days, 11 days, 12 days, 13 days, 14 days, 15 days, 16 days, 17 days, 18 days, 19 days, 20 days,
  • Y is 32 days.
  • Z is any one of 25 days, 26 days, 27 days, 28 days, 29 days, 30 days, 31 days, 32 days, 33 days, 34 days, 35 days, 36 days, 37 days, 38 days, 39 days, 40 days, 41 days, 42 days, 43 days, 44 days, 45 days, 46 days, 47 days,
  • Z is large enough to reach to the beginning of the EHR data.
  • a first time bin represents a time interval from 0 days up to 8 days before the cutoff event
  • the second time bin represents a time interval from 8 days before the cutoff event to 32 days before the cutoff event
  • a third time bin represents a time interval from 32 days before the cutoff event up until the beginning of the EHR data.
  • a first time bin represents multiple time intervals in the past before a cutoff event.
  • the first time bin represents between 0 and M intervals prior to the cutoff event.
  • the second time bin represents between M intervals prior to the cutoff event and N intervals prior to the cutoff event.
  • the third time bin represents between N intervals prior to the cutoff event and P intervals prior to the cutoff event.
  • M is any of 1 interval, 2 intervals, 3 intervals, 4 intervals, 5 intervals, or 6 intervals. In particular embodiments, M is 4 intervals.
  • N is any of 4 intervals, 5 intervals, 6 intervals, 7 intervals, 8 intervals, 9 intervals, or 10 intervals.
  • M is 8 intervals.
  • P is 8 intervals, 9 intervals, 10 intervals, 11 intervals, or 12 intervals. In various embodiments, P is large enough to reach to the beginning of the EHR data.
  • a first time bin represents 0 to 4 intervals before the cutoff event
  • the second time bin represents 4 to 8 intervals before the cutoff event
  • a third time bin represents 8+ intervals before the cutoff event until the beginning of the EHR data.
  • the feature processing module 160 generates time bins to ensure that at least a threshold number of time bins include at least one feature value. By ensuring that at least a threshold number of time bins include at least one feature value, the feature processing module 160 generates time-binned features that lack data sparsity.
  • the feature processing module 160 generates time bins to ensure that at least 80%, at least 81%, at least 82%, at least 83%, at least 84%, at least 85%, at least 86%, at least 87%, at least 88%, at least 89%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% of time bins include at least one feature value.
  • the feature processing module 160 generates time bins to ensure that 100% of time bins include at least one feature value.
  • FIG. 2B depicts an example time binning of features, in accordance with an embodiment.
  • FIG. 2B shows generation of three example time bins.
  • additional or fewer time bins can be generated to accommodate the features.
  • every time bin includes at least one feature value.
  • the time-binned features can be expressed according to their respective time bins instead of a particular time stamp. As such, the data sparsity that was present in the features in FIG. 2A is reduced or removed.
  • patient A may have three time-binned features in time bin 1, one time-binned feature in time bin 2, and two time-binned features in time bin 3.
  • Patient B may have one time-binned feature in time bin 1, two time-binned features in time bin 2, and two time-binned features in time bin 3.
  • the feature processing module 160 can additionally process the time-binned features. For example, the feature processing module 160 can further collapse the features in each time bin to further condense the information.
  • the result of this method is a single table of time-binned features, where each row of the table corresponds to a subject.
  • the feature processing module 160 collapses features in a time bin into a single feature by computing a statistical value of the features in the time bin.
  • the feature processing module 160 collapses features in a time bin into a single feature by computing an average value of the features in the time bin.
  • the feature processing module 160 collapses features in a time bin into a single feature by computing any of a summated value, minimum value, maximum value, mode value, or total count value of the features in the time bin.
  • FIG. 2C depicts example collapsed features across time bins, in accordance with an embodiment.
  • the time-binned features in a common time bin have been collapsed into a single feature.
  • features 1A, 2A, and 3A shown in time bin 1 in FIG. 2B have been aggregated into collapsed feature 1A shown in FIG. 2C.
  • Feature 4A shown in time bin 2 in FIG. 2B is the only feature in time bin 2.
  • feature 4A can represent collapsed feature 2A shown in FIG. 2C.
  • Features 5A and 6A in time bin 3 in FIG. 2B have been aggregated into collapsed feature 3 A shown in FIG. 2C.
  • Feature IB is represented as collapsed feature IB
  • features 2B and 3B are represented as collapsed feature 2B
  • features 4B and 5B are represented as collapsed features 3B.
  • Table 1A EHR events for Patient 1
  • the EHR events can be time-binned into separate bins.
  • a first bin (referred to as Bin 0) can include all events between 0-7 days before the COVID diagnosis.
  • a second bin (referred to as Bin 1) can include all events between 7-30 days before the COVID diagnosis.
  • a third bin (referred to as Bin 2) can include all events beyond 30 days before the COVID diagnosis.
  • Table 1C Collapsed Feature Table for Patients 1 and 2 where “N/A” refers to not applicable and is an indication of data missingness. Thus, the entries of “N/A” can be taken as data missingness features, as described herein, when deploying the machine learning model.
  • Patient 1 and Patient 2 share similar BMI; however, Patient 2 has a unique pattern of CRP availability (i.e., CRP was measured only in this patient).
  • the unique pattern of CRP availability may informative even though the CRP value is normal (e.g., as there may have been indications that suggested the pattern of CRP measurements).
  • Patient 1 has no CRP measurement which may be meaningful in the form of data missingness. For example, Patient 1 may not have had any health indications that would have suggested the need to obtain CRP measurements.
  • the model deployment module 190 implements one or more machine learning models to analyze features, such as time-binned features in which data sparsity is reduced or removed.
  • a machine learning model outputs a predicted health outcome for a subject based on the analysis of the features.
  • the time-binned features in which data sparsity is reduced or removed include features across a plurality of time bins.
  • the time-binned features can be features described above in reference to FIG. 2B in which features have been assigned to particular time bins to ensure that every time bin, or almost every time bin, has a feature value.
  • the time-binned features can be collapsed features described above in reference to FIG.
  • time-binned features exhibit reduced data sparsity such that the machine learning model can appropriately analyze the features without needing to deal with gaps in the features.
  • machine learning models are configured to further analyze a feature describing the data missingness present in the EHR data for the subject.
  • the additional data missingness feature can be an alternative representation of this data sparsity information.
  • the data missingness feature can include one or more time intervals between data points in the EHR data for a subject.
  • the data missingness feature may identify the time between a first timepoint during which EHR data was captured and recorded for a subject and a second timepoint during which subsequent EHR data was captured and recorded for the subject.
  • the data missingness feature can be statistical measure of time intervals between data points in the EHR data for a subject.
  • the data missingness feature can be an average, median, or mode of time intervals between data points in the EHR data for a subject.
  • the data missingness feature is generated from one or more tables (e.g., harmonized tables) of the EHR data.
  • example tables of the EHR data can include one or more of an insurance table, an observation table, a measurement table, a conditions table, a drug table, a visit occurrences table, and a feature concept table.
  • tables of the EHR data can include values indicative of missing data (e.g., a value indicative of missing data can be identified as “NA” or “NaN”).
  • the data missingness feature can represent an accumulation of values indicative of missing data across the one or more tables of the EHR data.
  • the data missingness feature can be a feature vector that includes he value indicative of missing data.
  • the data missingness feature is extracted from the organized time-binned features.
  • FIG. 2B it depicts example features (e.g., time-binned features) across the multiple time bins.
  • a value indicative of missing data e.g., a value indicative of missing data can be identified as “NA” or “NaN”
  • NA negative-negative-negative-negative-negative-negative-negative-negative-negative-negative-negative-negative-negative-binned features
  • the algorithm can recognize these values indicative of missing data as a special value. These values indicative of missing data are not compared or computed along with other feature values, but instead, the values indicative of missing data, when taken as a data missingness feature, can include patterns that are informative for generating the predicted health outcome. Altogether, by analyzing the time-binned features and the data missingness feature derived from EHR data for a subject, the machine learning model generates a predicted health outcome for the subject.
  • FIG. 3A shows an example block diagram for implementing a machine learning model, in accordance with an embodiment.
  • FIG. 3A depicts a data missingness feature 302 that captures the data sparsity information of the EHR data.
  • FIG. 3 A depicts time binned features 305 in which data sparsity of the EHR data has been reduced or removed.
  • FIG. 3A shows three time-binned features 305, in various embodiments, additional time-binned features can be generated and included.
  • the data missingness feature 302 and the time-binned features 305 are provided as input to the machine learning model 310.
  • the machine learning model 310 analyzes the data missingness feature 302 and time-binned features 305 to generate the predicted health outcome 140.
  • the machine learning model 310 further analyzes an additional feature representing an outputted prediction for a different task. For example, in a scenario where the machine learning model 310 is trained to predict a health outcome of hospitalization, the machine learning model 310 further analyzes an additional feature representing a prediction for a different health outcome (e.g., a severe disease outcome).
  • the additional feature representing a prediction for a different health outcome can be informative for predicting a health outcome of hospitalization.
  • the prediction outputted by a first machine learning model can serve as an input feature to the second machine learning model.
  • the first machine learning model is trained to predict a severe disease outcome.
  • the prediction of the severe disease outcome from the first machine learning model is used as an input feature to a second machine learning model trained to predict likely hospitalization.
  • Embodiments disclosed herein describe the training of machine learning models for predicting health outcomes for subjects.
  • the steps described in this section may, in various embodiments, be performed by the model training module 180 shown in FIG. IB.
  • machine learning models disclosed herein can be any one of a regression model (e.g., linear regression, logistic regression, or polynomial regression), decision tree, random forest, support vector machine, Naive Bayes model, k-means cluster, or neural network (e.g, feed-forward networks, convolutional neural networks (CNN), deep neural networks (DNN), autoencoder neural networks, generative adversarial networks, or recurrent networks (e.g., long short-term memory networks (LSTM), bi-directional recurrent networks, deep bi-directional recurrent networks).
  • the machine learning model is a logistic regression model.
  • the machine learning model is a random forest model.
  • the machine learning model can be trained using a machine learning implemented method, such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naive Bayes classification, K-Nearest Neighbor classification, random forest algorithm, deep learning algorithm, and gradient boosting algorithm.
  • a machine learning implemented method such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naive Bayes classification, K-Nearest Neighbor classification, random forest algorithm, deep learning algorithm, and gradient boosting algorithm.
  • the machine learning model has one or more parameters, such as hyperparameters or model parameters.
  • Hyperparameters are generally established prior to training. Examples of hyperparameters include the learning rate, depth or leaves of a decision tree, number of hidden layers in a deep neural network, number of clusters in a k-means cluster, penalty in a regression model, and a regularization parameter associated with a cost function.
  • Model parameters are generally adjusted during training. Examples of model parameters include weights associated with nodes in layers of neural network, support vectors in a support vector machine, and coefficients in a regression model.
  • the model parameters of the machine learning model are trained (e.g., adjusted) using the training data to improve the predictive power of the machine learning model.
  • the model training module 180 trains machine learning models using training examples from training data. For example, during training, the parameters of the machine learning model are tuned using training examples. In particular embodiments, at each iteration, the machine learning model generates a prediction by analyzing the training example and the parameters of the machine learning model are turned to improve the predictive capacity of the machine learning model.
  • a training example includes features, such as time-binned features and a data missingness feature that are determined from training data of a known subject (e.g., EHR data from a known subject).
  • a known subject e.g., EHR data from a known subject
  • the known subject may be known to experience a particular health outcome, such as a hospitalization or a severe disease outcome.
  • the particular health outcome experienced by the known subject can serve as a reference ground truth fortraining the machine learning model.
  • the reference ground truth can be represented as a value, such as a binary value or a continuous value.
  • the reference ground truth of the known subject can be represented as a first value.
  • the reference ground truth of the known subject can be represented as a second value, which differs from the first value.
  • the reference ground truth is a binary value (e.g., 0 or 1).
  • a value of 0 can indicate that subject does not subsequently experience a hospitalization or a severe disease outcome whereas a value of 1 can indicate that the subject subsequently experiences a hospitalization or a severe disease outcome.
  • the reference ground truth is a continuous value (e.g., between 0 and 1).
  • a value closer to 0 can reflect that the subject does not subsequently experience a hospitalization or experiences a less severe disease outcome.
  • a value closer to 1 can indicate that the subject experiences a near hospitalization event or a more severe disease outcome.
  • model parameters of the machine learning model are tuned to improve the prediction outputted by the machine learning model.
  • model parameters enable the machine learning model to differently weigh or consider corresponding input features.
  • the machine learning model can more heavily weigh or consider a first feature in comparison to a second feature.
  • a machine learning model tasked for predicting likely hospitalization of subjects includes one or more parameters such that the machine learning model more heavily weighs or considers a feature of geographical location (e.g., zip codes) or a feature derived from geographical location in comparison to other features.
  • the feature of geographical location or a feature derived from geographical location more heavily contributes towards the prediction of the machine learning model in comparison to other features analyzed by the machine learning model.
  • a machine learning model tasked for predicting likely hospitalization of subjects includes one or more parameters such that the machine learning model more heavily weighs or considers a feature of administrative settings or a feature derived from administrative settings in comparison to other features.
  • the feature of administrative settings or a feature derived from administrative settings more heavily contributes towards the prediction of the machine learning model in comparison to other features analyzed by the machine learning model. Examples of features of administrative settings includes features of insurance payer plan information, including insurance presence, insurance duration, and type of insurance plan.
  • a machine learning model tasked for predicting likely severe disease (e.g. severe COVID-19) in subjects includes one or more parameters such that the machine learning model more heavily weighs or considers a feature of a laboratory result or a feature derived from a laboratory result in comparison to other features.
  • the feature of a laboratory result or a feature derived from a laboratory result more heavily contributes towards the prediction of the machine learning model in comparison to other features analyzed by the machine learning model.
  • a machine learning model tasked for predicting likely severe disease (e.g. severe COVID-19) in subjects includes one or more parameters such that the machine learning model more heavily weighs or considers a feature of a gestation period or a feature derived from a gestation period in comparison to other features.
  • the feature of a gestation period or a feature derived from a gestation period more heavily contributes towards the prediction of the machine learning model in comparison to other features analyzed by the machine learning model.
  • a machine learning model tasked for predicting likely severe disease (e.g. severe COVID-19) in subjects includes one or more parameters such that the machine learning model more heavily weighs or considers a feature of a body mass index (BMI) or a feature derived from BMI in comparison to other features.
  • BMI body mass index
  • the feature of BMI or a feature derived from BMI more heavily contributes towards the prediction of the machine learning model in comparison to other features analyzed by the machine learning model.
  • a machine learning model tasked for predicting likely severe disease (e.g. severe COVID-19) in subjects includes one or more parameters such that the machine learning model more heavily weighs or considers a feature of a pre-existing chronic disease of the heart or a feature derived from a pre-existing chronic disease of the heart in comparison to other features.
  • the feature of a pre-existing chronic disease of the heart or a feature derived from a pre-existing chronic disease of the heart more heavily contributes towards the prediction of the machine learning model in comparison to other features analyzed by the machine learning model.
  • a machine learning model tasked for predicting likely severe disease (e.g. severe COVID-19) in subjects includes one or more parameters such that the machine learning model more heavily weighs or considers a feature of a pre-existing chronic disease of the kidney or a feature derived from a pre-existing chronic disease of the kidney in comparison to other features.
  • the feature of a pre-existing chronic disease of the kidney or a feature derived from a pre-existing chronic disease of the kidney more heavily contributes towards the prediction of the machine learning model in comparison to other features analyzed by the machine learning model.
  • FIG. 3B is an example flow diagram for predicting health outcomes using electronic health record data, in accordance with an embodiment.
  • Step 312 involves obtaining electronic health record (EHR) data for a plurality of subjects.
  • the subjects include pediatric subjects.
  • step 315 involves analyzing the EHR data for a subject. As shown in FIG. 3B, step 315 may, in various embodiments, further involve steps 320, 325, and 330.
  • step 320 involves generating features from the EHR data of the subject.
  • features can include patterns of use of health resources, clinical characteristics, or laboratory characteristics.
  • Such patterns of use of health resources, clinical characteristics, or laboratory characteristics may include derived features, which represent combinations of different EHR data.
  • An example of a clinical characteristic may be a combination of a first infectious disease related visit and a subsequent outpatient visit.
  • the clinical characteristic may be a quantifiable time between the first infectious disease related visit and the subsequent outpatient visit, which may be informative for predicting a health outcome.
  • An example of a laboratory characteristic can include a combination of neutrophil levels and lymphocyte levels.
  • the laboratory characteristic can be the ratio of neutrophil levels to lymphocyte levels, which may be informative for predicting a health outcome.
  • Further example features, including derived features representing combinations of EHR data are described herein.
  • Step 325 involves collapsing the features across one or more time bins to generate updated features.
  • EHR data of subjects can include data sparsities (e.g., lack of data) at various timepoints.
  • collapsing features across the one or more time bins reduces the data sparsity of the updated features.
  • the resulting updated features are represented in a feature matrix, such as an input feature matrix that can be analyzed by a machine learning model.
  • the feature matrix is a matrix of N rows and M columns, where N is the number of subjects and M is the number of updated features.
  • M is the number of time bins, such that one or more of the M time bins each include a corresponding updated feature.
  • Step 330 involves analyzing, using a machine learning model, the updated features with reduced sparsity to generate a prediction for the subject.
  • the generated prediction may be a score that is informative of a health condition.
  • step 315 may be performed for a different subject (e.g., a different subject for whom EHR data was obtained at step 310).
  • a prediction is generated for each subject.
  • one or more subjects at risk for a health outcome are identified using the predictions.
  • risk of a health outcome is risk for hospitalization e.g., due to an infectious disease.
  • risk of a health outcome is risk for a severe disease outcome e.g., due to an infectious disease.
  • subjects identified to be at risk for a health outcome can be provided the appropriate care or treatment to prevent or avoid the health outcome.
  • Embodiments disclosed herein involve predicting health outcomes for subjects using EHR data, such as subjects who have been previously diagnosed with an infectious disease (e.g., COVID-19).
  • a subject predicted to experience a severe disease outcome or hospitalization can be provided a therapy to treat the subject to avoid the severe disease outcome or hospitalization.
  • a therapy to be provided to a subject can include an antiviral treatment.
  • the antiviral treatment includes any of Nirmatrelvir with Ritonavir (PAXLOVID), Remdesivir (VEKLURY), or Molnupiravir (LAGEVRIO).
  • a therapy to be provided to a subject can include an antibody treatment.
  • the antibody treatment is as monoclonal antibody such as Bebtelovimab. Further details of the example therapies are described below in Table 1.
  • Table 1 Example therapies for treating subjects
  • a computing device can include a personal computer, desktop computer laptop, server computer, a computing node within a cluster, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
  • the different methods described above in relation to FIGs. 1A, IB, 2A, 2B, 2C, and 3 such as the methods for predicting a health outcome using EHR data, may be implemented using one or more computing devices.
  • the data sparsity prediction system 130 may be embodied as one or more computing devices.
  • the methods for predicting a health outcome using EHR data can be implemented in hardware or software, or a combination of both.
  • a non-transitory machine- readable storage medium such as one described above, is provided, the medium comprising a data storage material encoded with machine readable data which, when using a machine programmed with instructions for using said data, is capable of displaying any of the datasets and execution and results e.g., a prediction of a health outcome.
  • Embodiments of the methods described above can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), a graphics adapter, an input interface, a network adapter, at least one input device, and at least one output device.
  • a display is coupled to the graphics adapter.
  • Program code is applied to input data to perform the functions described above and generate output information.
  • the output information is applied to one or more output devices, in known fashion.
  • the computer can be, for example, a personal computer, microcomputer, or workstation of conventional design.
  • Each program can be implemented in a high-level procedural or object oriented programming language to communicate with a computer system.
  • the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language.
  • Each such computer program is preferably stored on a storage media or device (e.g. , ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein.
  • the system can also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
  • a computer readable medium comprising computer executable instructions configured to implement any of the methods described herein.
  • the computer readable medium is a non-transitory computer readable medium.
  • the computer readable medium is a part of a computer system (e.g., a memory of a computer system).
  • the computer readable medium can comprise computer executable instructions for performing the methods described herein e.g., methods for predicting a health outcome using EHR data.
  • the signature patterns and databases thereof can be provided in a variety of media to facilitate their use.
  • Media refers to a manufacture that contains the signature pattern information of the present invention.
  • the databases of the present invention can be recorded on computer readable media, e.g. any medium that can be read and accessed directly by a computer.
  • Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media.
  • magnetic storage media such as floppy discs, hard disc storage medium, and magnetic tape
  • optical storage media such as CD-ROM
  • electrical storage media such as RAM and ROM
  • hybrids of these categories such as magnetic/optical storage media.
  • Recorded refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g. word processing text file, database format, etc.
  • FIG. 4 illustrates an example computing device for implementing methods and systems described in FIGs. 1A, IB, 2A, 2B, 2C, and 3.
  • the computing device 400 includes at least one processor 402 coupled to a chipset 404.
  • the chipset 404 includes a memory controller hub 420 and an input/output (I/O) controller hub 422.
  • a memory 406 and a graphics adapter 412 are coupled to the memory controller hub 420, and a display 418 is coupled to the graphics adapter 412.
  • a storage device 408, an input interface 414, and network adapter 416 are coupled to the I/O controller hub 422.
  • Other embodiments of the computing device 400 have different architectures.
  • the storage device 408 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device.
  • the memory 406 holds instructions and data used by the processor 402.
  • the input interface 414 is a touch-screen interface, a mouse, track ball, or other type of input interface, a keyboard 410, or some combination thereof, and is used to input data into the computing device 400.
  • the computing device 400 may be configured to receive input (e.g., commands) from the input interface 414 via gestures from the user.
  • the graphics adapter 412 displays images and other information on the display 418.
  • the network adapter 416 couples the computing device 400 to one or more computer networks.
  • the computing device 400 is adapted to execute computer program modules for providing functionality described herein.
  • module refers to computer program logic used to provide the specified functionality.
  • a module can be implemented in hardware, firmware, and/or software.
  • program modules are stored on the storage device 408, loaded into the memory 406, and executed by the processor 402.
  • a computing device 400 can include a processor 402 for executing instructions stored on a memory 406.
  • Example 1 Example Data and Methods
  • the training data included accumulated data from August 2020 to July 30, 2021, consisting of 203,508 pediatric COVID- 19-positive patients from 55 contributing sites (Table 2).
  • the testing holdout set used prospectively-collected data that was accumulated during initial model development from July 30, 2021 to December 9, 2022 consisting of 201,083 pediatric COVID-
  • Tables 2A and 2B 19 patients from 64 sites. Patients that appeared in the training data were removed from the prospectively collected testing data.
  • Table 2A Demographic breakdown of patient cohort for training machine learning models
  • Table 2B Demographic breakdown of patient cohort for testing machine learning models
  • Task 2 was focused on predicting the risk of a patient needing additional medical intervention once they were hospitalized. Task 2 asked the following question: Of pediatric patients who tested positive for COVID- 19 and were hospitalized, who are at risk for needing mechanical ventilation or cardiovascular interventions? COVID- 19-positive hospitalized patients were defined as any patients who tested positive for COVID-19 within seven days of being admitted to an inpatient visit. The true positives included all patients who, during their hospitalization, needed mechanical ventilation, Extracorporeal Membrane Oxygenation, cardiovascular support, or expired during their hospital stay.
  • EHR data represents three challenges to conventional tabular data analysis: a) it is a time series b) it is extremely sparse c) it is extremely diverse in the sense that similar concepts have different encodings.
  • the goal involved building a feature matrix of N rows and M columns where N is the number of samples and M is the number of columnar features.
  • the timestamped EHR records were time binned into a small number of time bins relative to a cutoff date (e.g. average blood pressure of a subject 4 to 30 days before the cutoff).
  • a cutoff date e.g. average blood pressure of a subject 4 to 30 days before the cutoff.
  • a missingness aware gradient boosted tree classifier was employed which learns both data and missingness patterns simultaneously, rendering data imputation unnecessary.
  • the methods involve using both manual curation and count-based integration of data belonging to related concepts (e.g. instead of taking the average blood pressure for a subject, the number of blood pressure records for a subject are quantified in a time period).
  • EHR records are used from the following tables: location, visit occurrence, drug era, payer_plan_period, procedure occurrence, observation, condition era, measurement. The following features were generated and derived from the EHR records:
  • Time between the covid index e.g., date of first COVID related visit
  • the date of outpatient visit or hospitalization • Based on the concept set members table, 83 conditions, observations and measurements were manually curated and harmonized.
  • Task 2 e.g., predicting hospitalization
  • 0-4, 4-8, and 8+ intervals in the past from the hospitalization date were used.
  • the final feature matrix for Taskl (e.g., predicting severe disease outcome) has 1744 columns.
  • the final feature matrix for Task2 (e.g., predicting hospitalization) has 2509 columns.
  • the trained Task 1 model was evaluated on the Task 2 instances and use that output as an additional feature for Task 2.
  • HistGradientBoostingClassifier was used. This classifier treated the missingness patterns as a feature, therefore we use no imputation.
  • Example 3 Trained Classifier Successfully Predicts Pediatric Subjects at Risk for Hospitalization and Severe Outcomes
  • N3C National COVID Cohort Collaborative
  • features underwent time binning and other time related operations, sparsity filters to generate respective input features (e.g., collapsed features).
  • time binnings were used:
  • Time bin 1 the event happened 0 and 8 days before the outpatient visit
  • Time bin 2 the event happened between 8 and 32 days before the outpatient visit
  • Time bin 3 the event happened 32+ days before the outpatient visit
  • this machine learning classifier was able to extract patterns from a complex set of EHR data by recognizing what was missing from the data and assess the risk of hospitalization and severe outcomes among pediatric subjects. Based on the machine learning model, the performance was 85% accurate for hospitalized pediatric subjects and 87% accurate for pediatric subjects with severe outcomes.
  • the features most informative for predicting predict hospitalization following COVID- 19 infection included geographical location (zip codes) and administrative settings. This observation underlines that a seemingly clinical observation, e.g., hospitalization from disease, may have both biological and sociological components.
  • the features most informative for predicting severe COVID-19 disease outcomes in this pediatric population were albumin levels, oxygen saturation, gestation, BMI, and pre-existing chronic disease of the heart and the kidney.
  • the disclosed model that considered data missingness was able to predict the need for respiratory and cardiovascular intervention at the following performance metrics: an F2 of 0.594, an AUPRC of 0.591, and a cross-site AUROC of 0.853 (Table 3).
  • the disclosed model successfully predicted the need for cardiovascular intervention (FIG. 5A) and respiratory interventions (FIG. 5B) the most accurately up to 4 days before the outcome.
  • the disclosed model performed well across different races, gender, age, and BMI percentile with minimal variability within groups.
  • Cross-site AUROC is the AUROC when the model’s performance is calculated on each individual data site and the mean of those AUROCs is calculated
  • Machine learning is a process of extracting patterns from existing datasets and intra- or extrapolating these extracted patterns to novel, unseen cases. As such, building a successful, reliable ML system is dependent on a set of existing examples in sufficient diversity.
  • the prediction task was the severity of hospitalized COVID- 19 cases. Access was available to a small number of hospitalized cases and many non-hospitalized cases. To boost the performance of the system on the specific hospitalization task, all the samples (hospitalized and not) were first processed and a system was built which accurately predicts an auxiliary label. Then, this trained system was used as an additional component in the prediction system working on the smaller set of hospitalized cases predicting disease severity A [00142] Many of the features that lead to hospitalization and or severity are by their nature general or generalizable; i.e., socioeconomic features that associate with all -cause hospitalization, or clinical and laboratory features of severity across diverse diseases.
  • Example 5 Prediction of influenza in a prophylactic setting
  • EHR data in a given year up to the month October was used to predict the likelihood that an individual person will have an influenza diagnosis before the following May.
  • EHR data were used to build the features and the target variable. Specifically, patients who have no health system contact in the influenza season were excluded. Influenza specific ICD codes were used in the primary diagnosis in any health record, or a positive or abnormal influenza test as the target variable.
  • EHR data were further used to develop classifiers that predict severe influenza disease in an early treatment setting. Specifically, EHR data was used to estimate the likelihood of severe influenza disease given the medical history of a patient on and before the influenza diagnosis. A case was defined as severe if it involved hospitalization, intensive care admission, or procedures up to 28 days after the diagnosis. Featurization and classification algorithms similar to those described above in Examples 1 or 2 were implemented. Machine learning models were trained on EHR data up to August. Models were further evaluated using EHR data from October onwards. The precision at 1000 is 16-18% depending on the cross validation fold, which represented a 5-6 fold improvement over the baseline prevalence.
  • the predictive models were evaluated to determine the influence of individual features.
  • the single most influential feature in the predictive models was the patient’s age. Additionally the most influential procedure and diagnostic codes are shown below in FIGs. 7A and 7B, respectively.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

Les enregistrements de données de dossiers électroniques de santé (DES) comprennent souvent une rareté de données variable, ce qui rend difficile l'analyse des données de DES. Les procédés divulgués ici impliquent l'analyse d'enregistrements de données de DES pour générer des caractéristiques, telles que des caractéristiques temporelles, dans lesquelles la rareté des données est réduite ou éliminée. Ainsi, ces caractéristiques temporelles peuvent être fournies pour une analyse, par exemple, par des modèles d'apprentissage automatique pour prédire des résultats de santé liés à une maladie infectieuse probable pour des sujets.
PCT/US2023/036851 2022-11-07 2023-11-06 Utilisation de dossiers électroniques de santé rares pour prédire un résultat de santé WO2024102327A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263423227P 2022-11-07 2022-11-07
US63/423,227 2022-11-07

Publications (1)

Publication Number Publication Date
WO2024102327A1 true WO2024102327A1 (fr) 2024-05-16

Family

ID=89121495

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/036851 WO2024102327A1 (fr) 2022-11-07 2023-11-06 Utilisation de dossiers électroniques de santé rares pour prédire un résultat de santé

Country Status (1)

Country Link
WO (1) WO2024102327A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200003407A (ko) * 2017-07-28 2020-01-09 구글 엘엘씨 전자 건강 기록으로부터 의료 이벤트를 예측 및 요약하기 위한 시스템 및 방법
WO2022133258A1 (fr) * 2020-12-18 2022-06-23 The Johns Hopkins University Prédiction en temps réel de résultats défavorables à l'aide d'un apprentissage automatique

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200003407A (ko) * 2017-07-28 2020-01-09 구글 엘엘씨 전자 건강 기록으로부터 의료 이벤트를 예측 및 요약하기 위한 시스템 및 방법
WO2022133258A1 (fr) * 2020-12-18 2022-06-23 The Johns Hopkins University Prédiction en temps réel de résultats défavorables à l'aide d'un apprentissage automatique

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JUNYI GAO ET AL: "MedML: Fusing Medical Knowledge and Machine Learning Models for Early Pediatric COVID-19 Hospitalization and Severity Prediction", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 July 2022 (2022-07-25), XP091279516 *

Similar Documents

Publication Publication Date Title
Razzak et al. Big data analytics for preventive medicine
Feller et al. Using clinical notes and natural language processing for automated HIV risk assessment
US8990135B2 (en) Personalized health risk assessment for critical care
US8949079B2 (en) Patient data mining
Beeksma et al. Predicting life expectancy with a long short-term memory recurrent neural network using electronic medical records
Kimia et al. An introduction to natural language processing: how you can get more from those electronic notes you are generating
US10198499B1 (en) Synonym discovery
US20210375468A1 (en) Using Electronic Health Records and Machine Learning to Predict and Mitigate Postpartum Depression
EP3928322A1 (fr) Génération automatisée d'enregistrement de données patient structuré
Pramanik et al. Identifying disease and diagnosis in females using machine learning
Gupta et al. Clinical decision support system to assess the risk of sepsis using tree augmented Bayesian networks and electronic medical record data
Shaw et al. Timing of onset, burden, and postdischarge mortality of persistent critical illness in Scotland, 2005–2014: a retrospective, population-based, observational study
US11335461B1 (en) Predicting glycogen storage diseases (Pompe disease) and decision support
Bayramli et al. Predictive structured–unstructured interactions in EHR models: A case study of suicide prediction
US20240096500A1 (en) Identification of patient sub-cohorts and corresponding quantitative definitions of subtypes as a classification system for medical conditions
WO2023217737A1 (fr) Enrichissement de données de santé pour des diagnostics médicaux améliorés
WO2020148757A1 (fr) Système et procédé de sélection de paramètres requis pour prédire ou détecter une affection médicale d'un patient
Saputra et al. Hyperparameter optimization for cardiovascular disease data-driven prognostic system
WO2024102327A1 (fr) Utilisation de dossiers électroniques de santé rares pour prédire un résultat de santé
Sumathi et al. Machine learning based pattern detection technique for diabetes mellitus prediction
Boytcheva et al. Integrating Data Analysis Tools for Better Treatment of Diabetic Patients.
Ambhaikar A survey on health care and expert system
Avati et al. Predicting inpatient discharge prioritization with electronic health records
Seraphim et al. Prediction of diabetes readmission using machine learning
Melhem et al. Patient care classification using machine learning techniques