IL280496A - Machine learning models for predicting laboratory test results - Google Patents

Machine learning models for predicting laboratory test results

Info

Publication number
IL280496A
IL280496A IL280496A IL28049621A IL280496A IL 280496 A IL280496 A IL 280496A IL 280496 A IL280496 A IL 280496A IL 28049621 A IL28049621 A IL 28049621A IL 280496 A IL280496 A IL 280496A
Authority
IL
Israel
Prior art keywords
value
lab
emr
sample
medical field
Prior art date
Application number
IL280496A
Other languages
Hebrew (he)
Inventor
Tanay Amos
Mendelson Cohen Netta
JASCHEK Ram
SCHWARTZMAN Omer
LIFSHITZ Aviezer
Original Assignee
Yeda Res & Dev
Tanay Amos
Mendelson Cohen Netta
JASCHEK Ram
SCHWARTZMAN Omer
LIFSHITZ Aviezer
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yeda Res & Dev, Tanay Amos, Mendelson Cohen Netta, JASCHEK Ram, SCHWARTZMAN Omer, LIFSHITZ Aviezer filed Critical Yeda Res & Dev
Priority to IL280496A priority Critical patent/IL280496A/en
Priority to PCT/IL2022/050115 priority patent/WO2022162660A1/en
Publication of IL280496A publication Critical patent/IL280496A/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Control Of Electric Motors In General (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Description

MACHINE LEARNING MODELS FOR PREDICTING LABORATORY TEST RESULTS BACKGROUNDThe present invention, in some embodiments thereof, relates to a laboratory test results and, more specifically, but not exclusively, to a systems and methods for analyzing laboratory test results.Laboratory tests are central for patient evaluation and differential diagnosis in primary care. Clinicians analyze the test results by relying on reference normal ranges and qualitative examination of patients’ sparse and sometimes sporadic lab test histories.
SUMMARYAccording to a first aspect, a computerized method for training a machine learning (ML) model for predicting a target value range of a laboratory (lab) test for a target individual, comprises: accessing a plurality of sample electronic medical records (EMR) of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results, selecting at least one value of a medical field stored in the EMR indicating a known pathology, screening the plurality of sample EMR matching the at least one value of the medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value and includes a sub-set of the plurality of sample EMR non­matching the at least one value that represent lab tests taken from individuals over time intervals when no known indication of pathology was recorded, and training a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test and that excludes the at least one value of the medical field.According to a second aspect, a device for training a machine learning (ML) model for predicting a target value range of a laboratory (lab) test for a target individual, comprising: at least one hardware processor executing a code for: accessing a plurality of sample electronic medical records (EMR) of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results, selecting at least one value of a medical field stored in the EMR indicating a known pathology, screening the plurality of sample EMR matching the at least one value of the medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value and includes a sub-set of the plurality of sample EMR non-matching the at least one value that represent lab tests taken from individuals over time intervals when no known indication of pathology was recorded, and training a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test and that excludes the at least one value of the medical field.According to a third aspect, a computer program product for training a machine learning (ML) model for predicting a target value range of a laboratory (lab) test for a target individual, comprising a non-transitory medium storing a computer program which, when executed by at least one hardware processor, cause the at least one hardware processor to perform: accessing a plurality of sample electronic medical records (EMR) of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results, selecting at least one value of a medical field stored in the EMR indicating a known pathology, screening the plurality of sample EMR matching the at least one value of the medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value and includes a sub-set of the plurality of sample EMR non-matching the at least one value that represent lab tests taken from individuals over time intervals when no known indication of pathology was recorded, and training a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test and that excludes the at least one value of the medical field.According to a fourth aspect, a computerized method for training a machine learning (ML) model for predicting a target value range of a lab test for a target individual, comprises: accessing a plurality of sample EMRs of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results for at least one lab test, selecting at least one value of a medical field stored in the EMR correlated with a statistically significant change of the at least one lab test at a first time before a time stamp of the at least one value of the medical field relative to a second time after the timestamp of the at least one value of the medical field, screening the plurality of sample EMRs matching the at least one value of the medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value and includes a sub-set of the plurality of sample EMR non-matching the at least one value, and training a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test that excludes the at least one value of the medical field.According to a fifth aspect, a method for training a machine learning (ML) model for predicting a target value range of a lab test for a target individual, comprises: accessing a plurality of sample electronic medical records (EMR) of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results for at least one lab test, selecting at least one of: (i) at least one value of a first medical field stored in the EMR correlated with decreased survival of a subset of the plurality of sample individuals over a time interval, and (ii) selecting at least one value of a second medical field stored in the EMR correlated with a statistically significant change of the at least one lab test at a first time before a time stamp of the at least one value of the medical field relative to a second time after the timestamp of the at least one value of the medical field screening the plurality of sample EMR matching the at least one value of the first medical field and/or second medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value of the first medical field and/or second medical field and includes a sub-set of the plurality of sample EMR non-matching the at least one value of the first medical field and/or second medical field, and training a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test and that excludes the at least one value of the first medical field and/or the second medical field.In a further implementation of the first, second, third, fourth, and fifth aspects, selecting at least one value of the medical field stored in the EMR indicating a known pathology comprises selecting at least one value of the medical field stored in the EMR correlated with a decreased survival of a subset of the plurality of sample individuals over a time interval relative to survival of the plurality of sample individuals.In a further implementation of the first, second, third, fourth, and fifth aspects, screening comprises excluding EMRs matching values of the medical field that are occurring at time stamps later than that of another field reporting a code of an International Classification of Disease (ICD) indicating a severe chronic medical condition maintained over the time interval.In a further implementation of the first, second, third, fourth, and fifth aspects, further comprising excluding from the filtered dataset sample EMR that match at least one issued medication in the EMR associated with a relative risk above a threshold, of a first timestamp of the severe chronic condition appearing within a time range after a timestamp indicating initiation of the at least one issued medication, wherein the at least one issued medication is located at least in a threshold percentage of EMRs of subjects matching the severe chronic condition.In a further implementation of the first, second, third, fourth, and fifth aspects, further comprising: selecting at least one value of a second medical field stored in the EMR correlated with a statistically significant change of the at least one lab test at a first time before a time stamp of the at least one value of the second medical field relative to a second time after the timestamp of the at least one value of the second medical field, wherein screening comprises screening the plurality of sample EMRs using the at least one value of the medical field and a combination including the at least one value of the second medical field and the at least one lab test, to obtain the filtered dataset.In a further implementation of the first, second, third, fourth, and fifth aspects, the at least one value of the second medical field includes an indication of at least one medication issued to the respective sample individual.In a further implementation of the first, second, third, fourth, and fifth aspects, further comprising: generating a plurality of pairs covering a plurality of combinations, each pair including a combination of a respective historical lab test result and a respective medication, for each respective pair denoting a respective combination, assigning in association with the respective historical lab test result of the respective combination, an indication of whether the respective historical lab test result was obtained before or after a timestamp indicating initiation of administration of the medication of the respective combination, analyzing the plurality of pairs to identify a statistically significant difference in historical lab test results before and after the timestamp indicating initiation of issued medication, identifying a subset of pairs for which a statistically significant difference is identified, wherein screening comprises screening the plurality of sample EMRs using the at least one value of the medical field and the subset of pairs to obtain the filtered dataset.In a further implementation of the first, second, third, fourth, and fifth aspects, the target range is predicted for a current time interval, and further comprising: receiving a current value of the at least one lab test for the target individual obtained during the current time interval, and generating an alert when the current value of the at least one lab test is external to the target range of the at least one lab test of the target individual.In a further implementation of the first, second, third, fourth, and fifth aspects, the prediction indicative of the target range for the at least one lab test of the target individual is within a clinically defined normally range.In a further implementation of the first, second, third, fourth, and fifth aspects, further comprising: generating instructions for presenting a graphical user interface (GUI) on a display, the GUI including a graph plot of the plurality of historical values of the at least one lab test on a time scale, the target range predicted for the at least one lab test presented as a range on the time scale, the current value of the at least one lab test presented at a same time on the time scale as the target range.In a further implementation of the first, second, third, fourth, and fifth aspects, the target range is within the normal range, and the current value of the at least one lab test is within the normal range.In a further implementation of the first, second, third, fourth, and fifth aspects, the historical lab test results of the filtered dataset are within a clinically defined normal range.In a further implementation of the first, second, third, fourth, and fifth aspects, the current value of the at least one lab test is external to the target range is selected from a group consisting of: (i) fasting blood glucose level and/or cholesterol level exceeding the target range, further comprising treating the target individual by at least one member selected from a group consisting of: dietary consultation, physical exercise, medication treatment, (ii) hemoglobin level lower than the target range, with mean corpuscular volume (MCV) above or below a second target range, further comprising prioritizing colonoscopy, and/or bone marrow biopsy, and/or iron or b12 supplement, (iii) calcium and/or vitamin D below the target range, further comprising treating the target individual with calcium and/or vitamin D supplements, and (iv) Creatinine levels out of the target range, further comprising initiating chronic kidney disease consultation.In a further implementation of the first, second, third, fourth, and fifth aspects, the at least one lab test comprises a combination of a plurality of lab tests, wherein the ML model generates a prediction for each of the plurality of lab tests of the combination, in response to an input of the plurality of historical values for the combination of the plurality of lab tests.In a further implementation of the first, second, third, fourth, and fifth aspects, the target range is predicted for a future time interval, and further comprising: generating an alert when at least a portion of the target range is predicted to be external to a clinically normal range defined for at least one lab test at the future time interval.In a further implementation of the first, second, third, fourth, and fifth aspects, the at least one value of the medical field includes an indication of at least one chronic medical condition diagnosis for the respective sample individual.In a further implementation of the first, second, third, fourth, and fifth aspects, selecting at least one value of the medical field, further comprises identifying that the at least one value of the medical field is continually present in the EMR for at least a time interval greater than a threshold.In a further implementation of the first, second, third, fourth, and fifth aspects, screening further comprises screening the plurality of sample EMR for an indication of pregnancy.In a further implementation of the first, second, third, fourth, and fifth aspects, further comprising identifying sample EMRs storing the at least one value of the medical field during the time interval, and non-storing the at least one value of the medical field during a second time interval, wherein screening comprises screening the plurality of sample EMRs matching the at least one value of the medical field by excluding historical lab test results obtained during the time interval from the filtered dataset, and retaining in the filtered dataset historical lab test results obtained during the second time interval.In a further implementation of the first, second, third, fourth, and fifth aspects, each respective sample EMR is labelled with an indication of a birth date of the respective sample individual, and each of the plurality of historical lab test results is labelled with an indication of a date on which the at least one lab test was conducted, and further comprising: computing a respective age-normalized value of the respective sample individual for each of the plurality of historical lab test results according to the birth date of the respective sample individual and date of the respective historical lab test, wherein training the ML model comprises training the ML model on the filtered dataset for generating the target range for the at least one lab test of the target individual, in response to an input of a plurality of historical age-normalized values of the at least one lab test, and for each of the plurality of historical values.In a further implementation of the first, second, third, fourth, and fifth aspects, the plurality of sample EMRs are selected from a set of sample EMRs matching at least one demographic parameter, wherein the input of the target EMR into the ML model matches the indication of at least one demographic parameter.
In a further implementation of the first, second, third, fourth, and fifth aspects, each of the plurality of historical lab test results is labelled with a timestamp during which the at least one lab test was performed, and wherein screening further comprises excluding from the filtered dataset sample EMRs having the timestamp within a timestamp range indicating that the at least one lab test was performed outside of normal working hours.In a further implementation of the first, second, third, fourth, and fifth aspects, the at least one value of the medical field includes an indication of at least one medication administered to the respective sample individual.Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGSSome embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.In the drawings:FIG. 1 is a block diagram of a system for computing a filtered dataset from EMRs, for computing a trained ML model from filtered dataset, and/or for obtaining a predicted target range for a blood test from trained ML model, in accordance with some embodiments of the present invention;FIG. 2A is a flowchart of a method for training a ML model for predicting a target value range of a blood test for a target individual, in accordance with some embodiments of the present invention; FIG. 2B is a flowchart of a method for obtaining a target range for a blood test from the ML model trained on the filtered dataset, in accordance with some embodiments of the present invention;FIG. 3 is a schematic depicting an exemplary enhanced Patient lab report that includes the target range for a blood test, in accordance with some embodiments of the present invention;FIG. 4 includes schematics depicting an inference of age-sex-dependent lab distributions from EMR used in experiments, in accordance with some embodiments of the present invention;FIG. 5 depicts exemplary empirical lab test distributions of data used in the experiments, in accordance with some embodiments of the present invention;FIG. 6 depicts an exemplary personalization index conservation and heritability of data used in the experiments, in accordance with some embodiments of the present invention;FIG. 7 depicts exemplary lab regression prediction models used in the experiments, in accordance with some embodiments of the present invention; andFIG. 8 depicts exemplary abnormal lab classification models used in the experiments, in accordance with some embodiments of the present invention.
DETAILED DESCRIPTIONThe present invention, in some embodiments thereof, relates to a laboratory test results and, more specifically, but not exclusively, to a systems and methods for analyzing laboratory (lab) test results.An aspect of some embodiments of the present invention relates to systems, methods, a device, and/or code instructions (stored on a memory and executable by a processor(s)) for creating a filtered dataset of historical lab test results of one or more lab tests for "healthy" sample individuals, and/or that include lab test results obtained during "health" periods when the sample individuals were "healthy". The filtered dataset is created from electronic medical records (EMR) of sample individuals, where each sample EMR of each sample individual includes historical lab test results for one or more lab tests. Optionally, value(s) of medical field(s) of the EMR indicating a known pathology are selected. The filtered dataset is created by screening EMRs matching the value(s) of the medical field(s) for exclusion. EMRs matching the value(s) of the medical field(s) are excluded from the filtered dataset. EMRs non-matching the value(s) of the medical field(s) are included in the filtered dataset. The filtered dataset may represent historical lab results obtained from individuals over time intervals when no known indication of pathology is recorded. The historical lab test results of EMRs of the filtered dataset may be entirely or mostly within a normal value ranges defined by standard practice. A machine learning (ML) model, for example, a regressor, is trained on the filtered dataset of "healthy" individuals for generating a prediction indicative of a target range for the lab test(s) of a target individual, in response to an input of historical values of the lab test(s) of the target individual. A target EMR of the target individual may exclude the value(s) of the medical field(s) which were screened for creating the filtered dataset. EMRs of target individuals that match the value(s) of the medical field(s) which were excluded from the filtered dataset may be ineligible for processing by the ML model. Although the target range predicted for the target individual may fall within the normal range defined by standard practice, the target range may be used to identify a possibly clinically significant trajectory. When a current lab test result falls outside the predicted target range, yet the current lab test result is within the normal range for the current lab test defined by standard practice, it may signal a red flag which may warrant further investigation (e.g., increased monitoring) and/or treatment such as preventive treatment. For example, for a person having a current normal fasting blood glucose (FBG) level that is normal based on standard practice, but is higher than the predicted target range computed by the ML model based on historical FBG values of the person, may warrant early screening investigations and/or closer monitoring for type 2 diabetes, and/or preemptive treatment for type 2 diabetes may be initiated. The individual may be started on preventive treatment for type 2 diabetes, which would otherwise not be recommended based on the standard practice normal result. For example, the person may be placed on a weight loss diet, instructed to change to a healthier diet, and/or instructed to perform an exercise routine.Optionally, value(s) of medical field(s) of the EMR are dynamically selected based on the current records in the EMR, and used to screen records matching the value(s) of the medical field(s) for exclusion from the filtered dataset. The selected value(s) of the medical field(s) may dynamically change for different EMR sets of different sample individuals. Value(s) of a medical field(s) of the EMR that are correlated with decreased survival of a subset of sample individuals over time are selected, for example, chronic medical conditions correlated with at least a 10% decrease in survival over at least 5 years.Alternatively or additionally, value(s) of a medical field(s) of the EMR that are correlated with a statistically significant change of lab test(s) performed before a time stamp of value(s) of the medical field relative to after the timestamp, are selected, for example, administered medications that statistically significantly affect values of the lab test.
At least some implementations of the systems, methods, apparatus, and/or code instructions (stored on a memory and executable by a processor) described herein address the technical problem of training an ML model for predicting laboratory test results of an individual. The predicted results may be used, for example, for indicating risk of onset of a disease (e.g., chronic disease such as type 2 diabetes) when the predicted laboratory test results are abnormal. The individual may undergo early screening (e.g., colonoscopy) and/or preventive treatment (e.g., in an effort to prevent or reduce risk of the onset of the disease by reducing the risk of reaching the predicted abnormal test result). At least some implementations of the systems, methods, apparatus, and/or code instructions (stored on a memory and executable by a processor) described herein address the technical field of generating ML models for predicting laboratory test results of an individual. In prior art approaches, EMR are tagged with a label indicating a ground truth such as healthy (or normal) and non-healthy (or abnormal), and a standard classifier is trained on the labelled dataset. In such datasets there is a mixture of healthy and non-healthy records, where records of non-healthy individuals are usually over represented due to the simple fact that unhealthy people undergo many more lab tests than healthy people. In at least some implementations described herein, the improvement is the automatic generation of the filtered dataset which includes (mostly or entirely) EMRs of individuals defined as healthy, optionally EMRs of individuals matching value(s) of a medical field(s) indicating a known pathology are excluded from the filtered dataset and EMRs of individuals non-matching the value(s) of the medical field(s) are included in the filtered dataset. The filtered dataset represents lab tests taken from individuals over time intervals when no known indication of pathology is recorded.At least some implementation address the technical problem of identifying what data to exclude from the EMR of the individuals in a systematic fashion to obtain a "healthy" dataset. Over-representation of non-healthy individuals skews the distribution of lab results. For example, diabetic patients are commonly and repeatedly sampled for glucose, and so measuring the expected value of glucose for normal patients requires excluding the high glucose measurements of diabetic patients. In another example, patients treated with medication that affects the measured lab will also tend to be sampled more often, for example patients that take statin medication for high cholesterol may shift the distribution of measured cholesterol outside the "healthy" normal range.There is a very large number of events in each individual EMR, and it is not known a- priori which of these event should trigger exclusion, and for what duration. It is technically challenging to identify which individuals should be considered to be non-healthy for exclusion and/or at what time periods the individuals should be considered to be non-healthy for exclusion. For example, removing all individuals with a chronic medical condition and/or all individuals taking medications may leave no records or few records since almost all people taking a lab test have some medical condition or are taking some medication, otherwise there is little reason for these people to undergo lab testing. In one example different ICD9 disease codes – some are serious and other are not, and the ICD9 code is not necessarily linked one-to-one with the name. In another example, different medications can affect a lab result directly or be linked with changes in a lab only given a disease.Inventors developed an automated approach for defining and screening non-healthy individuals, based on a dynamic screening approach that selects value(s) of medical field(s) of the EMR indicating a known pathology. Optionally, the approach is dynamically selecting value(s) of medical field(s) of EMR of sample individuals based on current EMR of the sample individuals. Records matching the value(s) of the medical field(s) are excluded from the filtered dataset. Alternatively or additionally, the exemplary approach may include selecting value(s) of the medical field(s) based on a correlation with decreased survival of a defined subset of sample individuals, for example, value(s) of medical field(s) correlated with at least a 10% decrease in at least 5 year survival, which is computed dynamically based on the current EMR of the sample individuals. Inventors discovered that the value(s) of the medical field(s) may be certain chronic medical conditions, which depend on the actual EMRs being evaluated. The value(s) of the medical field(s) may be selected based on a statistically significant correlation with lab test results before and after the value(s) of the medical field(s) are recorded in the EMR. Inventors discovered that the value(s) of the medical field(s) may be certain medications that impact certain lab test results, i.e., the lab test result of a certain lab test before the administration of a respective medication is statistically significantly different than the lab test result after the administration of the medication. Inventors discovered surprising correlations, for example, Pemetrexed (a chemotherapy medication) was found to be associated with changes in Hemoglobin (HGB), in addition to known correlations, such as Metformin and blood glucose levels, and Simvastatin and cholesterol levels.In at least some implementations, the improvement is in using the filtered dataset, representing "healthy" individuals for training the ML model for predicting future lab test values for a "healthy" target individual. The predicted values for the target individual may be within clinically normal values (e.g., standard reference intervals) but may be abnormal for the target individual, representing a potentially clinically significant predictor for the target individual. A current lab test value for the target individual may be compared to the predicted values to determine whether the current lab test value may be abnormal for the target individual, yet normal within standard reference intervals, signaling a possible clinically significant risk for the target individual.In the standard clinical practice of modern medicine (i.e., existing approaches), clinical laboratory tests are used as one of the main tools for diagnosis and management of patients in variable settings. The current practice interprets these in two complementary ways. Initially, values are classified as either normal or abnormal based on predefined reference ranges [1–3]. Usually, the reference intervals are reported together with test results. The physician evaluates the results in the context of the individual patient’s medical history, present illness, and complementary diagnostic tests [4]. This integration of clinical data is necessary for medical diagnosis [5], facilitates the identification of unexpected changes, and prompts follow-up of on­going trends. While the goal of setting reference ranges is to provide a context to the test result using population-based distributions [1], the analysis of intra-individual trends and variation within these ranges is challenging and typically approached qualitatively [6,7]. Physician decisions on management of such within-norm changes are typically not supported by quantitative and precise decision algorithms).In the era of big-data medicine [9–12], there is potential for a universal quantitative basis for lab test interpretation that not only supports a precision-medicine strategy to standard practice, but may also create a point of reference for applications of machine-learning techniques and/or the various predictive measures constructed by their application. Without unbiased and comprehensive model for the variation of common lab readouts in healthy individuals [6,7], it is becoming increasingly difficult to critically interpret sophisticated tools for risk prediction and early disease detection [13–15]. Complex interactions between variables are difficult to detect and validate when cohorts and controls are repeatedly sampled for highly specific prediction tasks, in particular since several key chronic processes (e.g., diabetes, hypertension, blood ageing) are affecting multiple outcomes. A population-centric model for lab-test co-variation, as described herein in at least some implementations, may address such intricate dependencies and/or improve the robustness and interpretability of subsequent endpoint prediction tasks.At least some implementations of the systems, methods, apparatus, and/or code instructions (stored on a memory and executable by a processor) described herein improve over other standard approaches for analyzing laboratory test results. Multiple previous studies attempted to improve specific aspects of the quantitative lab test evaluation problem. For example, suggesting age- and race- adjusted hemoglobin cutoffs for anemia definition [16], delineating progressive decline of common labs markers years before overt diagnosis of malignancies [17,18], or exploring the trajectories of glomerular filtration rate (GFR) change in chronic kidney disease (CKD) patients [19]. Other notable examples include glycemic indices in different populations [20,21] and their associations with prognostic outcomes. However, many previous works were limited with respect to the breadth, depths and multi-layer nature of the data needed to support holistic and multi-variate model for lab-test distributions. At least some implementations of the systems, methods, apparatus, and/or code instructions provide an analysis of laboratory test result trends in individuals that are not currently diagnosed with a bona-fide disease requires longitudinal sampling by computing the filtered dataset that includes records of "healthy" individuals that excludes records of "unhealthy" individuals, as described herein. In contrast, some other approaches used prospective cohorts, where the prospective cohorts are deliberately designed to include sampled populations that are enriched with patients at higher known risk for the chronic conditions and/or already diagnosed with chronic conditions [19,20].At least some implementations described herein provide one or more models together setting up a precision-medicine framework to the interpretation of standard lab tests. These models may replace and/or augment current standard practice qualitative decision procedures that combine red-flags triggered by violations of absolute normal ranges with non-quantitative and non-compulsory analysis of personal patient histories. Personalized patient trajectories may be modelled and/or predicted given partial and/or sparse histories. Such modelling may provide interpretation of current lab tests that goes far beyond the application of rigid normal ranges. Variation in lab tests values between individuals may be mixing heritable and/or environmental effects that intensify with age to define highly individualized patient trajectories. Inventors discovered that even when lab values are well within the normal ranges, their quantitative interpretation is instructive and predictive, for example, for overall survival, future emergence of pathological lab results, and/or progression toward chronic diseases.At least some implementations described herein may augment and/or be integrated with current electronic health record management systems for supporting rich and/or personalized analysis of lab readouts using software updates alone, creating major opportunities for enhancing the quality of lab results interpretation.Modelling patient trajectories from retrospective community data is a technical challenge that is addressed and/or solved by at least some implementations described herein. The technical challenge arises at least from the highly biased and/or irregular data collection process, for example, in any healthcare environment, even in healthcare system with universal coverage such as the Israeli system that is analyzed in the "Examples" section below. These processes are designed to maximize immediate patient benefits, minimize the detrimental effects of false diagnoses, and allocate optimally limited financial resources. The process described herein (e.g., sometimes referred to as VISOR-MD pipeline in the "Examples" section) is designed to dramatically reduce sampling biases that result from routine clinical follow-up of chronic disease vs sparse and irregular interactions with healthy patients. The practicality of a holistic approach to data cleanup, involving massive screening and filtering of all diagnoses and medications that contribute to major sampling bias is described herein. In longitudinal resources (e.g., such as the Clalit dataset used in the experiment described in the "Examples" section below), filtering may be directed to disease time intervals in patient histories, retaining intervals prior to disease onset and/or routinely controlling for age, biological sex and/or additional factors. At least some implementations described herein may be deployed flexibly in systems with different testing policy and population characteristics and degrees of longitudinal coverage.Through the analysis of common lab tests and their predictive value for patients, Inventors discovered the space of chronic clinical conditions for which "early warning" is feasible using today’s standard lab tests. The precision ML models described herein can be used to evaluate the expected utility of screening patients further and to determine the optimal time intervals for such screening given personalized risk estimators. In parallel, and with the development of novel disease predictors and specific calculators, a unified quantitative framework for understanding patient trajectories in lab space may be used. This may be synergistic, for example, with new molecular markers and/or continuous sensors integration into the medical data acquisition and decision algorithms. At least some implementations described herein can contribute to such efforts, that should involve activity toward cross-system and/or international compatibility, endorsable standards and eventually the deployment of new tools in the field.Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.Reference is now made to FIG. 1, which is a block diagram of a system 100 for computing a filtered dataset 116B from EMRs 120A, for computing a trained ML model 116A from filtered dataset 116B, and/or for obtaining a predicted target range for a lab test from trained ML model 116A, in accordance with some embodiments of the present invention. Reference is now made to FIG. 2A, which is a flowchart of a method for computing a filtered dataset for training a ML model for predicting a target value range of a lab test for a target individual, in accordance with some embodiments of the present invention. Reference is also made to FIG. 2B, which is a flowchart of a method for obtaining a target range for a lab test from the ML model trained on the filtered dataset, in accordance with some embodiments of the present invention. System 100 may implement the acts of the method described with reference to FIG. 2A-2B, by processor(s) 102 of a computing device 104 executing code instructions 106A stored in a storage device 106 (also referred to as a memory and/or program store).Multiple architectures of system 100 based on computing device 104 may be implemented. In an exemplary implementation, computing device 104 storing code 106A may be implemented as one or more servers (e.g., network server, web server, a computing cloud, a virtual server) that provides centralized services for computing predicted target ranges for multiple target individuals, for example, accessing computing device 104 using respective client terminals 112 over a network 114. For example, computing device 104 provides software as a service (SaaS) to the client terminal(s) 112, provides software services accessible using a software interface (e.g., application programming interface (API), software development kit (SDK)), providing an application for local download to the client terminal(s) 112, and/or providing functions using a remote access session to the client terminals 112, such as through a web browser. In another example, computing device 104 may include locally stored software (e.g., code 106A) that performs one or more of the acts described with reference to FIG. 2A-2B, for example, as a self-contained client terminal that is designed to be used by users of the client terminal, for example, as a kiosk. In such implementation, the ML model may be locally trained, and/or the predicted target ranges may be locally computed. In yet another implementation, code 106A may be an add-on to an existing electronic medical record (EMR) management application that manages EMR 120A, for example, run and/or stored by a server 120 (e.g., EMR server, healthcare provider server). Code 106A may be remotely accessed by the EMR management application via an interface such as an API and/or SDK. Client terminals 112 may access server 120 to access their EMR 120A via the EMR management application. The predicted target ranges may be computed by code 106A (e.g., via the API and/or add-on) and presented within the EMR management application on a display of the respective client terminal 112, for example, within a graphical user interface (GUI) presentation generated by the EMR.Historical lab test results of the target individual, which are inputted into trained ML model 116A to obtain the predicted target range(s) and/or a current lab test result of the target individual which is compared to the predicted target range(s) may be obtained, for example, from EMR 120A.Computing device 104 may be implemented as, for example, a client terminal, a server, a computing cloud, an EMR server, a virtual server, a virtual machine, a mobile device, a desktop computer, a thin client, a Smartphone, a Tablet computer, a laptop computer, a wearable computer, glasses computer, and a watch computer.Processor(s) 102 of computing device 104 may be implemented, for example, as a central processing unit(s) (CPU), a graphics processing unit(s) (GPU), field programmable gate array(s) (FPGA), digital signal processor(s) (DSP), and application specific integrated circuit(s) (ASIC). Processor(s) 102 may include multiple processors (homogenous or heterogeneous) arranged for parallel processing, as clusters and/or as one or more multi core processing devices. Processor(s) 102 may be arranged as a distributed processing architecture, for example, in a computing cloud, and/or using multiple computing devices. Processor(s) 102 may include a single processor, where optionally, the single processor may be virtualized into multiple virtual processors for parallel processing.Data storage device 106 stores code instructions executable by processor(s) 102, for example, a random access memory (RAM), read-only memory (ROM), and/or a storage device, for example, non-volatile memory, magnetic media, semiconductor memory devices, hard drive, removable storage, and optical media (e.g., DVD, CD-ROM). Storage device 106 stores code 106A that implements one or more features and/or acts of the method described with reference to FIG. 2A-2B when executed by processor(s) 102.Computing device 104 may include a data repository 116 for storing data, for example, storing one or more of trained ML model 116A that generates the predicted target range, and/or filtered dataset 116B (created as described herein from EMRs 120A) used to train an ML model to generate trained ML model 116A, as described herein. Data repository 116 may be implemented as, for example, a memory, a local hard-drive, virtual storage, a removable storage unit, an optical disk, a storage device, and/or as a remote server and/or computing cloud (e.g., accessed using a network connection).Computing device 104 may include a network interface 118 for connecting to network 114, for example, one or more of, a network interface card, a wireless interface to connect to a wireless network, a physical interface for connecting to a cable for network connectivity, a virtual interface implemented in software, network communication software providing higher layers of network connectivity, and/or other implementations.Network 114 may be implemented as, for example, the internet, a local area network, a virtual private network, a wireless network, a cellular network, a local bus, a point to point link (e.g., wired), and/or combinations of the aforementioned.Computing device 104 may connect using network 114 (or another communication channel, such as through a direct link (e.g., cable, wireless) and/or indirect link (e.g., via an intermediary computing unit such as a server, and/or via a storage device) with client terminal(s) 112 and/or server(s) 120 and/or other computing devices, as described herein.Computing device may access EMR 120A for computing filtered dataset 116B, for example, via server 120 and/or via network 114.Computing device 104 and/or client terminal(s) 112 include and/or are in communication with one or more physical user interfaces 108 that include a mechanism for a user to enter data (e.g., select which lab test results and corresponding prediction to view and/or view the computed predicted target ranges, optionally within a GUI. Exemplary user interfaces 108 include, for example, one or more of, a touchscreen, a display, a keyboard, a mouse, and voice activated software using speakers and microphone.The ML model 116A may be, for example, a regressor such as xgboost, one or more classifiers, neural networks of various architectures (e.g., fully connected, deep, encoder-decoder, recurrent), support vector machines (SVM), logistic regression, k-nearest neighbor, decision trees, boosting, random forest, and the like. Machine learning models may be trained using supervised approaches and/or unsupervised approaches.Referring now back to FIG. 2A, at 200, sample electronic medical records (EMR) of sample individuals are accessed. Each respective sample EMR of each respective sample individual includes historical lab tests results for one or more lab tests.Lab tests are optionally common lab tests that are routines performed, for example, screening for conditions (e.g., cholesterol, diabetes, body mass index (BMI)) and/or monitoring of conditions (e.g., Hemoglobin HbA1c for monitoring of diabetes, and blood pressure for monitoring of hypertension).Exemplary lab tests include blood tests (e.g., CBC, fasting glucose, ALT, ALP, sodium, potassium, creatinine clearance, bilirubin, cholesterol profile), urinalysis tests, body measurements (e.g., blood pressure, body mass index (BMI), weight), microscopic and/or pathological analysis (e.g., to determine types of cells), culture and sensitivity test, radiological studies (e.g., radiological review of chest x-rays, CT scans, bone studies), electrocardiogram (ECG), Electroencephalogram (EEG), and stool (e.g., occult blood).There may be multiple historical lab tests results for each lab test obtained over a time interval, for example, over several years. For example, a yearly hemoglobin (e.g., part of a complete blood count (CBC)) test obtained for 10 years.The EMR may include additional parameters used herein, for example, medical history fields (e.g., diagnosis, smoking status, medications) and/or demographic fields (e.g., age, biological sex, geographic location).At 202, one or more values of one or more medical fields stored in the EMR that indicate one or more known pathologies are selected.The value(s) of medical field(s) indicating the known pathologies represent lab tests obtained from individuals over time intervals when no known indication of pathology was recorded. The value(s) of medical field(s) indicating the known pathologies represent individuals that are "unhealthy" or represent time intervals during which the individuals are "unhealthy", for removal from the full set of EMR to generate a filtered dataset of individuals that are "healthy", as described herein.Optionally, the value(s) of the medical field(s) indicating the known pathology is/are dynamically defined by an analysis, as a relative correlation between a subset of EMRs (or a subset of time intervals within the EMRs) which may represent "unhealthy" individuals, and the other portion of the total set of EMRs that excludes the subset which may represent "healthy" individuals and/or between the full set of EMRs of all individuals (e.g., to select how to differentiate between "healthy" and "unhealthy"). The dynamic approach may result in different selections of value(s) for different sets of EMRs. For example, for a set of patients at an old age home, the chronic conditions identified as correlated with reduced survival and/or the identified medications, may be different from chronic conditions and medications identified for another set of patients in a community clinic. For example, decreased survival of some individuals in relative to survival of the other individuals having EMRs (e.g., as described with reference to 204), and/or medications correlated with a significant change in values of lab test(s) (e.g., as described with reference to 206).Alternatively or additionally, the value(s) of the medical field(s) indicating the known pathology is/are statically defined (e.g., as predefined configurations, user selections, and/or obtained from storage). The static approach may result in same selections of value(s), even for different sets of EMRs. For example, all EMRs (and/or portions of the EMR corresponding to time intervals) of pregnant individuals, and/or all EMRs where lab tests were obtained outside of normal operating hours, for example, as described with reference to 208.Exemplary processes for selecting the value(s) of the medical field(s) stored in the EMR indicating one or more known pathologies are described with reference to 204-208, and may include one or more of 204, 206, and 208. Features described with reference to 204-206 may represent dynamic selection of value(s) computed based on the actual values of the set of EMRs. Features described with reference to 208 may represent static selection of value(s).At 204, one or more values of one or more medical fields stored in the EMR indicating the known pathology may be obtained by selecting value(s) of medical field(s) correlated with decreased survival of a subset of the sample individuals over a time interval relative to survival of other sample individuals.Decreased survival may be defined, for example, as a percentage, for example, a reduction in at least 5%, or 10%, or 15%, or other values. The time interval may be defined, for example, as at least 3, 5, 7 years, or other values.
The correlation may be computed for all values of all medical fields, and/or for selected medical fields.In exemplary embodiments, the medical field is an indication (e.g., diagnosis) of a chronic medical condition, and the value is the specific chronic medical condition, for example, cancer (of various types, for example, colorectal, prostate, breast), type 2 diabetes, stroke, and arthritis. The specific chronic medical condition may be implemented as specific International Classification of Disease (ICD) codes.Optionally, the selected value is identified as being maintained (e.g., continually present) in the EMR for at least a time interval greater than a threshold, for example, at least 3, 5, 7 years, or other values. Continually present values help identify chronic conditions, and/or help exclude transient and/or temporary conditions.The values of the historical lab test results may be normal according to standard reference intervals, and/or abnormal according to the standard reference intervals.At 206, one or more values of another medical field stored in the EMR correlated with a statistically significant change of lab test(s) at a time interval before a start of appearance (e.g., initial timestamp) of the value(s) of the other medical field in the EMR relative to another time interval after the start of appearance (e.g., initial timestamp) of the value(s) of the other medical field, is selected.Optionally, the other medical field is a medication field, and the value is an indication of one or more medications being issued to the respective sample individual. Each medication that is correlated with a statistically significant change in lab test results before start of the respective medication being issued to the individual relative to when the respective medication is being administered (i.e., after the start of administration) is selected.Inventors discovered that correlation does not imply causation, and some medications may be surprisingly correlated with statistically significant changes in certain lab tests.Optionally, the medications are identified in combination with the identified chronic medical conditions (e.g., as described with reference to 204). Optionally, for each of the identified chronic medical conditions (and/or other value of the medical field associated with decreased survival identified as described with reference to 204), issued medication stored in the EMR are identified as described herein, and/or when where there is a relative risk above a threshold (e.g., 10, 20, 30, 50, or other value) for having the first timestamp of severe condition within a first time interval (e.g., 2, 3, 4 years or other value) after the issue medication and up to a second time interval (e.g., 3, 6, 9, 12 months, or other value) before the issued medication.
An exemplary process for identifying the medication that significantly impacts lab test(s) includes: generating pairs covering multiple combinations, where each pair includes a combination of a respective historical lab test result and a respective medication. For each respective pair denoting a respective combination, an indication is assigned of whether the respective historical lab test result was obtained before or after a timestamp indicating initiation of administration of the medication of the respective combination. The indication is assigned assigning in association with the respective historical lab test result of the respective combination. The pairs are analyzed to identify a statistically significant difference in historical lab test results (e.g., significant increase and/or decrease) before and after a timestamp indicating initiation of issued medication, for example, the lab test results significantly increased after the initiation of the issued medication, and/or the lab test results significantly decreased after the initiation of the issued medication. In other words, each pair may be identified according to a statistically significant difference in the statistical distribution of two paired sets of a certain lab tests value for subjects with a purchase event of the respective medication. A subset of pairs for which a statistically significant difference is identified is identified. The pairs may be further included when the statistical significant change is obtained for lab results having time stamps at most over a first threshold (e.g., 15, 30, 45, 60 days or other values) before a first purchase event of the respective medication, and timestamps at most over a second threshold (e.g., 1, 3, 6 months or other values) after the first purchase event of the respective medication, for example, to confirm direct medication effect on changes in test values due to the medication.The subset of pairs is excluded from sample EMRs obtain the filtered dataset, as described herein (e.g., with reference to 212).Optionally, at 208, other parameters of the EMR, optional static values, are selected for screening, for exclusion from the filtered dataset.Optionally, an indication of pregnancy is selected for screening. Pregnancy may be selected due to its temporary and transient nature, and due to temporary changes to lab tests that may occur during pregnancy (e.g., increased cholesterol level). EMRs of pregnant individuals may be excluded from the filtered dataset.Optionally, historical lab test results labelled with a timestamp indicating when respective lab test was performed, that are external to a selected time range are selected for screening. The time range may indicate normal working hours. Historical lab tests results performed outside normal working hours may be selected for screening (i.e., excluded). The exclusion may be based on Inventor’s observation that tests that are usually performed outside of normal hours are for individuals that are "unhealthy", for example, due to an emergency situation, and/or in association with specific test requiring obtaining blood samples at specific hours. EMRs of individuals having test results obtained external to normal working hours may be excluded from the filtered dataset.Optionally, historical lab test results labelled with an indication that they were obtained during a hospital admission are selected for exclusion, based on the assumption that individuals admitted to hospital are not "healthy".Optionally, a demographic parameter is selected, for example, biological sex and/or age. EMRs meeting the selected demographic parameter are retained in the filtered dataset (or excluded from the filtered dataset). Different ML models may be computed for different demographic parameters, for example, a respective ML model for each biological sex. Optionally, EMRs of individuals age 20-90, or other ranges, are selected for inclusion in the filtered dataset. EMRs of individuals less than 18 may be selected for exclusion from the filtered dataset. EMRs of pediatric individuals may be excluded, since medical results of children cannot necessarily be extrapolated to adjults. EMRs of the very old may be excluded base on the assumption that the very old have at least some medical condition which warrants exclusion. The input into a certain ML model may matches the demographic parameter of the EMRs used to create the ML model.At 210, further processing of the sample EMRs is performed. The processing of the sample EMRs may be performed prior to filtration, and/or the processing may be performed on the filtered dataset, i.e., the features described with reference to 210 are performed after the features described with reference to 212.The EMRs may be adjusted for body mass index (BMI) and/or socio-demographic classification, and/or other parameters.Optionally, lab test results in time windows where a certain individual is lacking lab test results are imputed using the trained ML model 214. In such implementation, the filtered dataset may be initially created for individuals with full sets of lab tests results. An initial ML model is trained on the initial filtered dataset. The initial ML model is used to predict the missing lab test values for the individuals missing the lab test values. The filtered dataset is updated with the predicted lab test values. The initial ML model is updated based on the updated filtered dataset.Optionally, lab test results are normalized according to age. Each respective sample EMR may be labelled with an indication of a birth date of the respective sample individual. Each of the historical lab test results may be labelled with an indication of a date on which the respective lab test was conducted. A respective age-normalized value of the respective sample individual may be computed for each of the historical lab test results according to the birth date of the respective sample individual and date of the respective historical lab test.An exemplary process based on features 202-210 to identify sample EMR is now described. ICD 9 codes stored in the EMR with at least 10% reduction in survival in adults (age > 18) and repeated diagnosis indicating severe chronic condition are identified. For each of the identified severe conditions, issued medication stored in the EMR are identified, where there is a relative risk above 30 for having the first timestamp of severe condition within 3 years after the issue medication and up to 6 months before the issued medication. The issued medication may be found in at least 50% of the identified severe condition patients. The time of onset of the severe condition may be adjusted to the first timestamp when the issued medication was purchased. Following identification of the severe conditions, non-healthy status may be defined from months prior to first timestamp of any of the identified ICD 9 codes associated with severe condition, in addition to pregnancy periods concluded from pregnancy and child delivery ICD codes.At 212, the sample EMR are screened by matching to value(s) of the medical field(s) (e.g., specific chronic medical conditions, specific medications, pregnancy, lab test performed outside of normal hours, lab test performed during hospital admission) and/or parameters (e.g., selected demographics)) selected as described with reference to 204-208. The matching sample EMR are excluded from the set of sample EMR. A filtered dataset is obtained by excluding the sample EMR matching the value(s) of the medical field(s) and includes a sub-set of the sample EMR that are non-matching to value(s) of the medical field(s). The EMRs in the filtered dataset may represent lab tests taken from individuals over time interval(s) when no known indication of pathology was recorded. The filtered dataset represents EMR of "healthy" individuals.Optionally, the historical lab test results of the filtered dataset are within a clinically defined normal range, i.e., a standard reference interval. Alternatively, some lab test results of the "healthy" individuals in the filtered dataset are abnormal with respect to the standard reference interval. Such abnormal test results may represent natural variations in "healthy" individuals.Optionally, some EMRs matching the values of the medical parameter correlated with decreased survival (e.g., specific chronic conditions) include the value of the medical parameter for a time interval, and do not include the value of the medical parameter during another time interval. For example, some individuals may have a 10 year history with no chronic medical conditions, followed by the appearance of hypertension and type 2 diabetes for another 5 years.
In such cases, historical lab test results obtained during the time interval when the respective EMR matches the value(s) of the medical parameter are excluded from the filtered dataset, while historical lab test results obtained during the time interval when the same respective EMR does not match the value(s) of the medical parameter are retained in the filtered dataset. Referring to the above example, lab test results during the 10 year period with no chronic medical conditions are retained, while lab tests during the 5 year period when the individual is diagnosed with hypertension and type 2 diabetes are excluded. In another example, over the course of 10 years of lab tests, tests performed when the individual was admitted to hospital are excluded while tests performed when the individual was not admitted to hospital are included. In yet another example, for an individual on-and-off a certain medication over a 10 year people, tests performed when the individual is off the medication are included while tests performed when the individual is on the medication are excluded.Screening is performed, for example, by excluding EMRs matching the value(s) of the medical field that are occurring at time stamps later (e.g., by at least a threshold, for example, months, 1 year) than that of another field for example, reporting the ICD code indicating a severe chronic medical condition maintained over the time interval. Alternatively or additionally, screening is performed by excluding sample EMR that match the issued medication(s) in the EMR associated with a relative risk above a threshold, of a first timestamp of the severe chronic condition appearing within a time range after a timestamp indicating initiation of the issued medication(s). The issued medication(s) may be located in at least in a threshold percentage of EMRs of subjects matching the severe chronic condition.Alternatively or additionally, screening is performed by excluding sample EMRs that both match the value(s) of the medical parameter correlated with decreased survival (e.g., ICD code) and a combination of identified issued medications and lab test significantly affected by the issued medication, to obtain the filtered dataset.At 214, an ML model is trained on the filtered dataset.One or more ML models may be trained on different subsets of the filtered dataset, for example, a respective ML model per biological sex, and a respective ML model per lab test or a respective ML model per combination of lab tests, for example, one ML model based on hemoglobin data, and another ML model based a combination of multiple lab tests including hemoglobin and platelet count and white blood cell count. Alternatively, a single ML model is trained for a combination of all selected lab tests.
The ML model is fed an input for a target individual that includes historical values of the lab test(s) included in the filtered dataset (or subset thereof) used for training the ML model, optionally obtained from a target EMR of the target individual. Optionally, the target EMR of the target individual is screened for validating that there are no matches with the value(s) of the medical field(s) and/or other parameters used in the screening process for creating the filtered dataset. The ML model generates a prediction indicative of a target range for the lab test(s) of the input. The prediction of the target range represents a personalized prediction for the target individual, which may be different than the standard defined reference intervals, as described herein.Optionally, for implementations where the historical lab values are normalized according to age, the ML model is trained on the filtered dataset for generating the target range for the lab test of the target individual, in response to an input of historical age-normalized values of the lab test.Exemplary ML model architectures include: a regressor, xgboost, logistic regression, neural networks of various architectures (e.g., fully connected, deep, encoder-decoder, recurrent), support vector machines (SVM), k-nearest neighbor, decision trees, boosting, random forest, and combinations of the aforementioned. Machine learning models may be trained using supervised approaches and/or unsupervised approaches. Additional exemplary architectures and/or exemplary parameters of ML models used to perform experiments are described in the "Examples" section below.Referring now back to FIG. 2B, at 250, one or more ML models are provided and/or selected. The ML model(s) may be selected when multiple different ML models are trained on different subsets of the filtered dataset including different combinations of lab test results, that correspond to current lab test(s) performed by a target individual, and/or based on historical lab test results for the target individual, and/or selected manually by a user (e.g., within a GUI).Optionally, the target EMR of the target individual is screened for validating that there are no matches with the value(s) of the medical field(s) (e.g., chronic medical conditions, medications) and/or other parameters (e.g., age, pregnancy) used to exclude sample EMRs from the filtered dataset used to train the provided and/or selected ML model.At 252, historical lab test results of the target individual are fed into the ML model. The historical lab tests may be obtained, for example, automatically from the target EMR of the target individual and/or manually entered by a user (e.g., via a user interface). The historical lab test results may be inputted as one or more sets and/or sequences, for example, a respective sequence of historical lab test results per lab test, and/or a set that includes all a timestamped historical lab test results labelled with an indication of the respective lab test.The historical lab tests results may be processed, for example, normalized according to the age of the target individual.At 254, the ML model generates a prediction of a target range for each of the lab tests, or for a target lab test such as when a combination of lab test results of a combination of lab tests in inputted. For example, the predicted range may be for hemoglobin in response to an input of a combination of lab tests results for hemoglobin, creatinine, and blood glucose.The predicted range may be separated by the most recent lab test result by at least a selected time interval threshold, for example, at least 2-3 years or other values. The prediction may be for 2-3 years past the most recent lab test result. The prediction may be for a current time interval, and/or the prediction may be for a future time interval such as 2-3 year past the current date.At 256, a current lab test result of the lab test for the target individual may be obtained. For example, a hemoglobin lab test result for a lab test performed last week is obtained. For the case of obtaining the current lab test, the predicted target range corresponds to the current time interval. For example, the lab test result is obtained in the last day, week, month, 3 months, year, or other value defining the current time interval.The target range may be obtained from the ML model by inputting historical lab test results that are older than at least the selected time interval threshold, for example, lab test results that are at least 2-3 year old are fed into the ML model, to obtain the target range for a current and/or real time.The predicted range may be entirely within the standard reference interval.Alternatively or additionally, at 258, a standard reference interval for the lab test is obtained.Different scenarios of the current lab test result (obtained in 256) and the predicted target range (obtained in 254) with respect to the standard reference interval may be obtained, indicating potential clinically significant abnormalities, for example:• The current lab test result falls within the standard reference interval and the predicted target range may fall within the standard reference interval. When the current lab test result falls outside of the predicted target range, the lab test results may be flagged as abnormal for the target individual, even though the lab test result is normal in comparison to the standard reference interval.
• At least a portion of the predicted range may be external to the standard reference interval, indicating that even though current and historical lab test results have been normal (i.e., falling within the standard reference interval), future lab test results are predicted to be abnormal with respect to the standard reference interval.At 260, instructions for presenting a presentation on a user interface, optionally a graphical user interface (GUI) on a display, are generated.Optionally, the presentation (e.g., GUI) including a graph that plots the historical values of each lab test inputted into the ML model versus a time scale axis. The predicted target range outcome of the ML model may be presented as a range of predicted values for lab test at a location on the time scale corresponding to the prediction. The current lab test result (when available) may be plotted on the graph corresponding to the time when the current lab test result is obtained.The standard reference interval may be presented on the graph plot, for example, as lines defining the outer values of the standard reference interval range. The lines may be presented across the entire range of historical values of the lab test(s) and/or presented with reference to the predicted target range. The lines corresponding to the standard reference interval range enable a quick visualization to determine whether historical lab test results and/or the current lab test results fall within the standard reference interval range, and therefore would otherwise be considered as "normal" using standard approaches.The plot of the predicted target range relative to the standard reference interval may provide a quick visualization to determine whether any portion of the predicted target range falls outside the standard reference interval range, and therefore the predicted target range is predicted to be abnormal even when some or all of the historical lab test results fall within the standard reference interval range and are therefore considered normal.The plot of the current lab test result relative to the predicted range and/or relative to the standard reference interval enable a quick visualization to determine whether the current lab test result is abnormal for the target individual (i.e., the current lab test result falls outside of the predicted range) even when the current lab test result is considered as "normal" using standard approaches (i.e., the current lab test result falls within of the standard reference interval).Reference is now made to FIG. 3, which is a schematic 306 depicting an exemplary enhanced Patient lab report 350 that includes the target range for a lab test that includes a blood test, in accordance with some embodiments of the present invention. Patient lab report 350 may be presented for example, within a GUI on a display of a client terminal. Line graphs 352 for blood tests WBC, HGB, PLT, MCV, RDW, LYM%, NEUT%, CREATININE, GLUCOSE, CHOLESTEROL, ALT represents a presentation of current test results based on standard practice, with normal ranges defined by a standard reference interval marked by a bolded line (highlighted in turquoise, 354 is one example) and abnormal results relative to the standard reference interval depicted in red (356 is one example).Graphs 390 for the blood test results include predicted target ranges indicated by grey bars (364 is one example), for example, +/- 1 standard deviation or other range, computed from an ML model trained on a filtered dataset, as described herein. Historical lab test measurements are shown by black dots (358 is one example) plotted along a time axis for each blood test result. The historical lab test measurements are inputted into the ML model trained on the filtered dataset for obtaining predicted target range 364. Historical blood test measurement older than a threshold (e.g., 6 months, 1 year) may be fed into the ML model. The current blood test result (which is also presented in portion 352) is shown in turquoise (360 is one example) when falling within the standard reference interval, or shown in red when falling outside the standard reference interval (362 is one example) corresponding to values 356 in portion 352. Predicted target ranges 364 are shown on the time axis corresponding to the predicted time interval. Optionally yellow highlighted measurements (366 is one example) were excluded from the filtered dataset, and not used for prediction model due to either hospitalization or potentially medication-affected measurement. Prediction of high risk (e.g., >5%) for abnormal lab values in a future time interval (e.g., 2- 3 years) is indicated on the right (368 is an example), where the prediction of abnormality is relative to the predicted target range.Patient report 350 may visually depict special cases, which may not be apparent using standard approaches. For example:For the WBC test result depicted by arrow 380, the historical blood test results and the current blood test result fall within the standard reference intervals. Moreover, the current blood test result falls within the predicted target interval.For the HGB test result depicted by arrow 382, the historical blood test results and the current blood test result fall within the standard reference intervals. However, the target range is predicted to at least partially fall outside the standard reference interval in 2 years, indicating a risk of 8.8% that in 2 years the HGB blood test will be low relative to the standard reference interval.For the RDW test result depicted by arrow 386, the historical blood test results and the current blood test result fall within the standard reference intervals. However, the current blood test result falls outside the predicted target interval while still being "normal" based on the standard interval. An alert and/or preventive treatment of the patient and/or further investigation and/or monitoring may be triggered.For the NEUT% test depicted by arrow 388, the historical blood test results and the current blood test result fall outside the standard reference intervals, indicating that these results are "abnormal" based on the standard reference intervals. However, the current blood test result falls within the predicted target interval, indicating that the blood test result may be "normal" for this individual, even though the test result would be "abnormal" based on the standard reference interval.Referring now back to FIG. 2B, at 262, an alert may be generated, for example in the following cases:• When the current result value of the lab test is external to the target range of the lab test of the target individual, regardless of whether the current lab test result falls within or external to the standard reference interval. I.e., the current lab test result may be abnormal even when falling within the standard reference interval (which would be considered as normal using a standard approach) when the current lab test result is external to the predicted result.• When at least a portion of the predicted target range is external to the standard reference interval.The alert may be, for example, a pop-up message, an email, a phone call, a note in an inbox, and/or generated code which may be inputted into another executing process to trigger further processing.At 264, in response to the alert, the individual may be treated, may undergo screening, and/or recommendations for treatment and/or screening may be generated. The screening may be for early detection of diseases that may be linked to the lab test. The treatment may be for early preventive treatment of diseases that may be linked to the lab test. For example:• For a case of a fasting blood glucose level and/or cholesterol level exceeding the predicted target range, and/or at least a portion of the predicted range being external to the standard reference interval, the target individual may be treated by dietary consultation, physical exercise, and/or medications.• For a case of hemoglobin level lower than the predicted target range, with mean corpuscular volume (MCV) above or below another predicted target range (for MCV), and/or the normal ranged defined by a standard reference interval, the target individual may be treated with prioritized colonoscopy, and/or bone marrow biopsy, and/or iron supplements and/or B12 supplements.• For a case of calcium and/or vitamin D below the predicted target range, and/or at least a portion of the predicted range being below the standard reference interval, the target individual is treated with calcium and/or vitamin D supplements and/or weight bearing exercises.• For a case of creatinine levels above and/or below the predicted target range, chronic kidney disease consultation may be initiated for the individual.Various embodiments and aspects of the present invention as delineated herein above and as claimed in the claims section below find experimental and/or calculated support in the following examples.
EXAMPLES Reference is now made to the following examples, which together with the above descriptions illustrate some embodiments of the present invention in a non-limiting fashion.Inventors conducted experiments based on at least some implementations described herein, performing multivariate precision modelling of 2.1B billion lab measurements of different lab tests from 2.8 million adults (age range 20- 90) over a span of 18 years in the Clalit Healthcare services system. Unsupervised analysis identified 131 chronic conditions and 52drug-test pairs that affected tests distributions, i.e., which were used as values of medical fields for filtering the records, retaining 545 million tests to create a virtual survey of lab tests in healthy individuals, i.e., the records included in the filtered dataset. Inventors show age and biological sex alone explain less than 10% of the within-norm test variance in 89 out of 92 tests. Inventors developed ML models predicting lab values, as described herein, with R2 of over 60% for 17 tests and over 36% for half of the tests. Using the ML models, Inventors predicted if currently healthy individuals will present with abnormal test levels within 2-3 years. Inventors exemplify how this can provide advanced risk stratification for overall survival or specific subsequent diseases. These data show that multivariate modelling of lab tests can be used to infer patients’ deterioration potential while their lab values are within the currently assumed normal ranges.Inventors performed a retrospective analysis of a large and integrative electronic health record resource from the Clalit healthcare system [22,23]. Since the data derived from electronic health records (EHR) is strongly biased toward patients going through active chronic diseases (e.g., people tend to undergo lab tests when they are sick), Inventors developed VISOR-MD - a research engine for performing a virtual survey on data obtained from healthy individuals during standard medical diagnoses and follow up procedures, based on at least some implementations described herein. VISOR-MD performs a systematic and unbiased classification of patients’ lab tests trajectories, filtering segments representing major pathological conditions or effects of specific drugs for generating the filtered dataset, for example, as described herein. In the Clalit system this generates a filtered dataset including ~0.5 billion lab measurements from 2.7M individuals. Based on this data Inventors developed tools for multi-variate longitudinal analysis and show they can predict patients’ within-norm lab trajectories at surprisingly high accuracy, as described herein. ML models can then be applied to patients with still- healthy lab readouts in order to evaluate, for example, risks for future lab tests abnormalities, deterioration toward multiple types of chronic diseases and overall mortality. These data are opening the way for implementing precision tools for interpretation of complex lab histories and characterize patient pre-disease trajectories, supplementing rigid clinical normal ranges with flexible and rich quantitative models.Reference is now made to FIGs. 4-8, which are schematics presenting methods and/or results of the experiments, in accordance with some embodiments of the present invention.FIG. 4 includes schematics depicting an inference of age-sex-dependent lab distributions from EMR used in the experiments, in accordance with some embodiments of the present invention. Flowchart 402 depicts an exemplary VISOR-MD workflow, as described herein. Community lab measurements are labeled systematically by segmenting patient trajectories into periods of health, medication and chronic disease (see methods section below). Labs tests acquired from patients within health segments are used to redefine empirical lab ranges across the healthy ageing spectrum and to derived personalized models for predicting patients’ future lab abnormalities and subsequent disease. Schematic 704 depicts exemplary EHR lab ascertainment bias. Bars indicate the number of lab measurements (y-axis) by age (x-axis) according to patient inferred status (color) for females (left) and males (right). Status is determined by screening time intervals under the effect of diagnoses (dx) affecting overall survival, medications (med) that correlate with test value alteration, pregnancy, or hospitalization (hosp.). Schematic 406 depicts exemplary lab distribution age trends. Shown are lab percentiles (color) by age (x-axis) for females (left) and males (right). Median lab value is depicted by black line. Standard lab ranges are marked for reference as dashed lines. Schematic 408 depicts a summary of age trends. Median age-controlled lab values were normalized given the matching distributions in healthy 20yo individuals. The heatmap depicts clusters of lab tests showing correlated age-linked trends. Data is from males.FIG. 5 depicts exemplary empirical lab test distributions of data used in the experiments, in accordance with some embodiments of the present invention. Schematic 502 depicts an exemplary Log Hazard Ratio derived by cox proportional hazards regression model using normalized lab values. Hazard ratios were computed for all healthy patients at age 60-65, grouped by mean normalized lab levels reported on the year prior to the index date (Jan 1st 2011). For reference we also report hazard ratios for healthy patients with above (below) normal range labs in the same time period in red (blue). Confidence intervals represent standard deviations. Schematic 504 depicts an exemplary UMAP projection of multivariate normalized labs for healthy patients. Each dot represents a single patients’ measurements. Color coding represents lab test age-normalized value (see Methods section below).FIG. 6 depicts an exemplary personalization index conservation and heritability of data used in the experiments, in accordance with some embodiments of the present invention. Schematics 602 depict an exemplary process for computing personalization indices and EHR­based heritability (h2). Age-biological sex normalized labs were used to compute personalization index (left) for 2-years (dark turquoise), 5-years (light turquoise) by sampling a normalized lab value per age and comparing it to a sampled normalized value in the corresponding time window for all "healthy" time-intervals. The personalization index is the Pearson correlation computed per age range and biological sex. Narrow sense heritability h2 (right) was computed using tests in patient health intervals only (see FIG. 4). For each lab, Inventors computed correlations between offspring average tests and parents’ average tests level. For each individual Inventors considered all data points in his/her healthy time interval. Differences in parent/child ages distributions were accounted for by transforming raw data to age/biological sex controlled percentiles Schematic 604 shows lab test personalization indices stratified by duration (color) and age (X axis). Schematic 606 compares personalization indices in males age 40-45 (x-axis) to change in personalization between age 40-45 and age 70-75 (y-axis). Schematic 608 compares EHR- heritability based on data on young individuals (<50yo) and older individuals (>50yo). Color- coded according to three ranges of test personalization index as in schematic 606.FIG. 7 depicts exemplary lab regression prediction models (an implementation of the ML model) used in the experiments, in accordance with some embodiments of the present invention. Schematic 702 depicts an exemplary regression model. For learning lab regression models Inventors use data points (empty ovals) from patient trajectory segments (depicted as a line) classified as healthy and showing within norm levels at least until prediction date. Inventors use data in a time window 2-6 year prior to prediction data, and ignore data acquired in the two years immediately prior to the prediction data. Sparse profiles are imputed (filled ovals) using a prediction model working at a 6 month time interval. Schematic 704 shows the r2 values for different regression models predicting lab test values two years forward in time for patients in different ages (X axis, left male/female panels). Also shown are r2 values as a function of historical lab test data availability, ranging from a single year covered at time of prediction to four years covered consecutively at time of prediction.FIG. 8 depicts exemplary abnormal lab classification models (i.e., implementations of the ML models) used in the experiments, in accordance with some embodiments of the present invention. Schematic 802 depicts abnormal lab prediction models use training data from patient trajectories classified as healthy and within-norm at least up to 2 years prior to prediction date. The models aim at prediction of normal/abnormal lab status based on patient histories 2-6 years prior to prediction date. Schematic 804 shows the relative risk values for observing abnormal lab values two years from prediction date using models with increasingly more features. Each plot depicts estimated relative risk based on cross validation per age and biological sex. Schematic 806 depicts cumulative percentage of colon cancer new cases for controls (no anemia) and three microcytic anemia patient cohorts, all 60-70yo. Confidence intervals are shown in lighter colors. Using normalized lab histories, the model clears low-risk microcytic anemia cases and prioritize colonoscopy follow up for high risk patients. Schematic 808 is similar to 806, and presents outcomes for patients with macrocytic anemia, generated using a prediction model trained for MDS applied to patients age 60 to 80 year old (yo). Schematic 810 depicts cumulative percentage of new T2DM diagnosis in five cohorts. Analysis considered all healthy patients at index date 1.1.2011 between ages 50 to 60 and normal fasting glucose (FG < 100) at least until 1.1.2009. High risk cohorts include patient identified at risk (2y predict abnormal) or increased risk (2y abnormal severe) based on 2009 data (compare to low risk cohorts 2y predict normal). Controls are patient observed in practice with glucose levels higher than 100 at 2011 (observed 100-110) and patient with persisting normal glucose levels (observed <100). Schematic 812 is similar to 810, and depicts outcomes for chronic renal failure occurrence in patients with normal levels of creatinine 2y prior to index date that were predicted to develop abnormal high creatinine (creatinine > 1.3 for males, creatinine > 1.1 for females) at index date. Controls are patients stratified by eGFR levels given test data on the index date.
METHODS Approximation of health periods on a population scale Patient’s health status was established according to the following procedure1. Initialization of severe chronic patient cohortsIdentification of recurrent ICD 9 codes with reduced survival – each ICD 9 diagnosis 4- digit code (format xxx.y) that was assigned to at least 1000 different patients was considered to reflect reduction in survival if at least 10% of the adult population (age > 18) showed:• 10% reduction in 5 year survival• Average number of days between consecutive diagnosis code < 3 years (this was performed to remove transient temporary irregular health states)• Expected number of diagnosis per year according to average number of days between consecutive diagnosis per person < 4 * average number of diagnosis per year2. Refinement of cohorts• For each of the above cohorts and each medication, age/biological sex stratified relative risk was computed for being in the cohort given that the medication was prescribed in the (-3, -0.5) year time period prior to onset. For each medication with a weighted relative risk > 30 over all age groups and number of patients with medication > 0.5 of total number of patients, the time of onset was adjusted to the first time the medication was prescribed.• For each of the above cohorts and each ICD 9 diagnosis 4-digit code, age/sex stratified relative risk was computed for being in the cohort given the ICD 9 code in the (-2.5, - 0.5) year time period prior to onset. Each diagnosis with a weighted relative risk > 10 in over 20% of the total number of patients in the cohort was excluded from the cohort.• Additional refinement of cohorts was performed by manual review.3. Extraction of non-healthy status ICD 9 codes from cohortAll ICD 9 codes contributing patients to any severe chronic cohort were identified.4. Pregnancy statusInferred from pregnancy and delivery ICD 9 codes (V22, V23, V27, V30-V37, 633-637, 640­645, 647-676) and clalit specific codes and child birth data. When exact gestation period is unknown, 42 week gestation period is assumed.5. Compute health status – For each patient (age between 20 and 100) mark every month (30 days) whether this patient was not diagnosed with any of the non-healthy ICD 9 codes in the past or in the next 6 months, and was not pregnant in the last 30 days.
Detection and filtering of drug-lab pairing For each lab and for each atc-5 medication code with at least 100 patients, average percentile of test value is extracted per patient in the 30 days prior to first medication prescription fulfillment ("before") and 3 months after ("after") for all non-pregnant patients between the age 20 and 100. If the number of patients with non-missing "before" and "after" values > 10 and the mean absolute difference between "before" and "after" > 0.05 and the corrected (Holm Bonferroni) wilcox test p-value < 0.01 then pairing is set between drug and lab (Table S3). Medication filtered lab tests are all lab tests conducted on patients that were not prescribed any lab-paired medication in the time period of 6 months before and 1 month after the lab test. Some examples include METFORMIN medication in GLUCOSE-BLOOD test, SIMVASTATIN medication in CHOLESTEROL. It is noted that the pairing of drug-lab does not necessarily imply causality, for example, PEMETREXED, a chemotherapy medication, was found to be associated with HGB.
Raw lab tests normalization For each lab, age between 20 and 100 (1 year resolution), and sex (male or female) the cumulative distribution of test values was computed, normalizing raw values to percentiles, excluding tests that were conducted on non-healthy patients or pregnant women (see above), medication filtered tests, and tests conducted outside community lab working hours (7am – 2pm).
Clustering of median lab age-sex trends For each lab, the median age trends were computed for ages 20-90 (1 year resolution) for males and females separately. Age trend was normalized by subtracting value at age 20 and dividing by the general population variance (quantile 75 – quantile 25) per sex. K- means clustering was applied for each sex independently.
Personalization index For each lab, sample normalized test (see above) by age (1 year resolution). Considering only patients between the age 20 and 90, compute the correlation in normalized lab values stratified by the difference in years (2, 5, 10) between the two tests, age and sex. h2 estimation For each lab, compute the linear regression coefficient of mid-parent and offspring average quantile normalized values within specific age range (20-90 for full range, 20-50 for young range, 50-80 for old range).
Lab Regression models Sample healthy population for model training and testing For each different lab test, consider only tests that were conducted on patients between the age and 90, excluding tests that were conducted on non-healthy patients or pregnant women (see above), medication filtered tests, and tests conducted outside community lab working hours (7am – 2pm). For each patient sample a single time in which the test was conducted and prior test value was measured in the time window the prior 2 to 3 year time window (for 2 year prediction) or 6 months to 2 years prior to sampled lab test (for 6 month prediction). Sampled population was down-sampled for uniform age distribution (5 year resolution) and divided into 5-folds controlling for age and sex distribution across all folds.
Age/biological sex prediction model (m1) 5-fold cross validation of age (1 year resolution) and biological sex stratified cumulative distribution models trained on sampled lab populations. Predictions were sampled according to age and biological sex. Single lab single time point model (m2) fold cross validation gradient boosting tree models were used with 3 features: age, biological sex and average lab value in the prior 2-3 year time window (for 2 year prediction) or 6 month to years’ time window (for 6 month prediction). Training was done via xgboost 0.81.0.1 R package with "gbtree" booster, and "reg:linear" objective using the following hyper- parameters:• nrounds=750• subsample=1• max_depth=2 • colsample_bytree=1• eta=0.025• eval_metric=rmse• min_child_weight=3 multi lab single time point model (m3) fold cross validation gradient boosting tree models were used with features including age, biological sex and average lab values in the prior 2-3 year time window (for 2 year prediction) or month to 2 years window (for 6 month prediction) for 92 different lab tests. Training was done via xgboost 0.81.0.1 R package with "gbtree" booster, and "reg:linear" objective using the following hyper-parameters:• nrounds=1250• subsample=0.75• max_depth=5• colsample_bytree=0.8• eta=0.01• eval_metric=rmse• min_child_weight=2 Lab imputation via 6-month regression model The 6 month regression model was used to predict missing lab values every 6 months at times when an actual lab test was not available in the time window (-6 month, 0) but either imputed or actual lab test data existed in the time window (-2 years, -6 month) which is used as input feature (along with age and sex) to single lab single time point regression model m2. Via an iterative process, all possible values are imputed by applying the appropriate cross validation model (each patient was not used for training in at least one of the models) and inferring predicted lab value.
Multi lab multi time point model (m4) Similar to multi lab single time point model m3, 5-fold cross validation gradient boosting tree models were used on imputed data, averaging normalized lab values every 6 months in the years to 2 years prior to prediction. For each fold, the top 15 labs were selected according to the corresponding m3 model mean absolute shapley values of that fold. Training was done via xgboost 0.81.0.1 R package with "gbtree" booster, and "reg:linear" objective using the following hyper-parameters:• nrounds=1950• subsample=0.5• max_depth=6• colsample_bytree=0.6• eta=0.01• eval_metric=rmse• min_child_weight=3• gamma=0.7 Abnormal lab classification models Sample healthy population for model training and testing For each lab test, consider only tests that were conducted on patients between the age 20 and 90, excluding tests that were conducted on non-healthy patients or pregnant women (see above), medication filtered tests, and tests conducted outside community lab working hours (7am – 2pm). For each patient sample a single time in which the test was conducted and all prior test values measured 2 to 3 years prior, and 3 to 4.5 years prior and 4.5 to 6 years prior were in normal range. Sampled population was down- sampled for matching normal/abnormal age/ biological sex distribution at (5 year resolution) and divided into 5-folds controlling for age and biological sex distribution across all folds. Simple age/sex abnormal lab classification model (m1) fold cross validation gradient boosting tree models were used with 2 features: age and biological sex. Training was done via xgboost 0.81.0.1 R package with "gbtree" booster, and "binary:logistic" objective using the following hyper-parameters:• nrounds=50• subsample=0.7• max_depth=3• colsample_bytree=1• eta=0.05• eval_metric=auc• min_child_weight=1• gamma=10• max_delta_step=1 Single lab single time point abnormal lab classification model (m2) fold cross validation gradient boosting tree models were used with 3 features: age, biological sex and a prior lab value in 2-3 years’ time window prior to prediction. Training was done via xgboost 0.81.0.1 R package with "gbtree" booster, and "reg:linear" objective using the following hyper-parameters:• nrounds=1000• subsample=0.7• max_depth=4• colsample_bytree=1• eta=0.1• eval_metric=auc• min_child_weight=1• gamma=1 Multi lab single time point abnormal lab classification model (m3) fold cross validation gradient boosting tree models were used with features including age, biological sex and average prior labs value in 2-3 years’ time window prior to prediction for different lab tests. Training was done via xgboost 0.81.0.1 R package with "gbtree" booster, and "reg:linear" objective using the following hyper-parameters:• nrounds=1000• subsample=0.5• max_depth=4• colsample_bytree=0.9• eta=0.05• eval_metric=auc• min_child_weight=1• gamma=1 Multi lab multi time point abnormal lab classification model (m4) Similar to multi lab single time point abnormal lab classification model (m3), 5-fold cross validation gradient boosting tree models were used on imputed data, averaging normalized lab values in 6-month time windows 6 years to 2 years prior to prediction. For each fold, the top labs were selected according to the corresponding m3 mean absolute shapley values of that fold.
Training was done via xgb o ost 0.81.0.1 R pack a ge with " g btree" bo o ster, and " reg:linear" objective using the following h yper-para m eters:• nrounds=2500• subsample=0.5• max_depth=4• colsample_bytree=0. 9 • eta=0.01• eval_metric=rmse• min_child_weight=1• gamma=10 Abnormal lab classification models relative risk comparison All 4 different models, m1, m 2, m3, a n d m4, we r e applied to the sam epopulatio n . For each model, cutoff score was set according t o 0.2 sen s itivity for each 10-y e ar age res o lution and biologi c al sex strata. Relative risk (rr) w a s computed for each a g e-group/s e x accordin gto: _______* * norm_______ TP + FP 1 1 norm * 1 1 norm ____ pop pos____pop pos + pop neg TP * 1 norm FP 1 1 norm Where: pop pos denotes number of p o sitive case sfound in general pop u lation (be fo re down s a mpling) pop neg denotes number of n e gative cas e s found in g eneral population (be f ore down s ampling) TP^pop pos (TP+FN) FP*pop neg (FP+TN) Abnormal glucose cumulative incidence of Diabetes Mellitus Two m u lti lab multi time point abnorma llab classi f ication mo d els were c o mputed f o r abnormal glucose levels. Both models r equired p a tients to h a ve FG<10 0in the 2- 6years period prior to prediction. One model cons i dered abn o rmal glucose (positi v e) to be F G >= 10 0while the prediab e tes model considered FG >= 11 0to be posi t ive. Both m odels wer eapplied to all patients betwee n50yo and 60yo at i n dex data 1.1.2011. Patients wer eexcluded i f there was no recent glucose test (in the past year) or there were insufficient data in the previous 6-2y time frame to make model prediction (required availability of 10 of the top 20 features used for model prediction). Models cutoff was set at 25% sensitivity by cross validation, stratified for each 10- year age group/ biological sex. In addition to models classification, patients were classified according to their latest measured FG. Kaplan- Meier survival curves were computed using survminer R package 0.4.3.
Abnormal creatinine cumulative incidence of Chronic Renal Failure A multi lab multi time point abnormal lab classification model was computed for abnormal creatinine levels (creatinine > 1.3 for males, creatinine > 1.1 for females). Model required patients to have normal creatinine in the 2-6 years period prior to prediction. Predictive model was applied to all patients between 60yo and 70yo at index data 1.1.2011. Patients were excluded if there was no recent creatinine test (in the past year) or there were insufficient data in the previous 6-2y time frame to make model prediction (required availability of 10 of the top features used for model prediction). Models cutoff was set at 25% sensitivity by cross validation, stratified for each 10-year age group/ biological sex. In addition to models classification, patients were classified according to their latest measured eGFR. Kaplan-Meier survival curves were computed using survminer R package 0.4.3.
Prediction of Colon Cancer in patients with microcytic anemia Patients were considered to have microcytic anemia if measured hemoglobin (HGB) lab test was below normal (<12 for females (excluding pregnancy), and <14 for males) and mean corpuscular volume (MCV) was less than 80. Colon cancer onset was defined as the first time a patient was diagnosed with colon cancer with no prior cancer diagnosis. A 3-6 month colon cancer prediction model was computed for patients with microcytic anemia at least 3 months prior to cancer onset, without prior ICD 9 code for THALASSEMIAS (282.4) or IDIOPATHIC PROCTOCOLITIS (556). Predictor model features included all 92 labs sampled every months from 3 years prior to prediction. 5-fold cross validation training was done via xgboost 0.81.0.1 R package with "gbtree" booster, and "binary:logistic" objective using the following hyper- parameters:• nrounds=5000• subsample=0.7• max_depth=6 • colsample_bytree=0.7• eta=0.001• eval_metric=auc• min_child_weight=2• gamma=0.1• lambda=0.01• alpha=0.01Colon cancer prediction model was applied to all patients between the age 60-70 with microcytic anemia, no thalassemia and no documented proctocolitis in clalit database at 1.1.2011 and classified according to score quantile into low score (score quantile < 0.4), medium score (score quantile >= 0.4, score quantile < 0.8), and high score (score quantile >= 0.8). Colon cancer incidence in microcytic anemia patients was compared to incidence in general population via Kaplan-Meier survival computed using survminer R package 0.4.3.
Prediction of MDS in patients with macrocytic anemia Patients were considered to have macrocytic anemia if measured hemoglobin (HGB) lab test was below normal (<12 for females (excluding pregnancy), and <14 for males) and mean corpuscular volume (MCV) was larger than 100. MDS onset was defined as the first time a patient was diagnosed with Myelodysplastic syndrome with no prior cancer diagnosis. A 3-6 month MDS prediction model was computed for patients with macrocytic anemia at least 3 months prior to cancer onset. Predictor model features included all 92 labs sampled every 12 months from years prior to prediction. 5-fold cross validation training was done via xgboost 0.81.0.1 R package with "gbtree" booster, and "binary:logistic" objective using the following hyper­parameters:• nrounds=5000• subsample=0.7• max_depth=6• colsample_bytree=0.7• eta=0.001• eval_metric=auc• min_child_weight=2• gamma=0.1• lambda=0.01• alpha=0.01 MDS prediction model was applied to all patients between the age 60-70 with macrocytic anemia, in clalit database at 1.1.2011 and classified according to score quantile into low score (score quantile < 0.25), medium score (score quantile >= 0.25, score quantile < 0.9), and high score (score quantile >= 0.9). MDS incidence in macrocytic anemia patients was compared to incidence in general population via Kaplan-Meier survival computed using survminer R package 0.4.3.
UMAP projection For each lab test, consider only tests that were conducted on patients between the age 20 and 90, excluding tests that were conducted on non-healthy patients or pregnant women (see above), medication filtered tests, and tests conducted outside community lab working hours (7am – 2pm). For each patient, a single age (1 year) was sampled, in which all labs considered were measured at least once (N=266,170), and normalized test values (age/biological sex quantiles) were transformed ln(q/(1-q)) and projected using uniform manifold approximation and projection (UMAP) via uwot R package version 0.1.9 with the following parameters:• n_neigbors=20• a=2• b=1.5 Socio-economical stratification 1) Associate each patient with the clinic it visited most times during healthy period (number of visits > 3).2) For each clinic with over 500 patients compute:• The average number of children (quantile normalized by biological sex and year of birth) for people born between 1950 and 1970• The average age of parent when first child was born (quantile normalized by biological sex and year of birth) for all people born between 1950 and 1970• For each patient, the median BMI measured between the age 50 and 75 was considered. Median BMI for all people between the age 50 and 75 (quantile normalized by biological sex and year of birth) excluding outliers (bmi < 5 and bmi > 100)• Median age at death – computed by Kaplan meier survival fit for each clinic.
• Socio-economic index of the clinic city according to the central bureau of statistics in Israel3) Clinics were clustered according to the above features using k-means clustering into different socio-economic clusters.4) Patients were assigned to socio-economic cluster according to their associated clinic.
RESULTS VISOR-MD for modelling healthy untreated lab test ranges Inventors developed a computational methodology for running a virtual survey on observational retrospective big medical data (VISOR-MD) (see 402 of FIG. 4, and the Methods section). The VISOR-MD pipeline stratifies lab tests trends in generally healthy individuals. This is achieved by automated large- scale filtering and annotation of data acquired from patients at times when they were neither diagnosed with a condition that affects their age- and biological sex- matched prognosis nor were they known to be treated by a drug directly affecting the test results, for creating the filtered dataset, as described herein (e.g., see Methods section). In some cases, the systematic exclusion of records acquired under the effect of a major disease or drug cannot completely control for missing data, sampling bias and errors. This approach is also not compensating for multiple hidden factors (e.g., demographic effects) that are only indirectly recorded in the system. Nevertheless, VISOR-MD massively corrects for over-representation of patients with chronic diseases, while enabling analysis at a very large scale compared to prospective survey strategies, which enables at least some implementations described herein to be used while providing sufficiently statistically accurate outcomes. Inventors applied VISOR- MD to the de-identified clinical histories of 2.8M adults (age >=20) with total follow-up time of 43M person-years in the Clalit system. Inventors observed that while the overall rate of lab testing in the population is increasing with age, the total data for bona fide healthy individuals in the survey is stable for ages 20-59 (range 324,147 – 1,390,114 samples per lab test per 10 years age group) and declines for ages >60, when progressively more patients are under the effect of at least one chronic condition or medication (see 404 of FIG. 4 ) . Even for ages >60 Inventors obtained sufficient lab data on individuals that were not identified with chronic disease, allowing systematic inference of rich multivariate models across the age spectrum of 20 year old (yo) to yo.Inventors studied the empirical ranges inferred by VISOR-MD for six representative common tests (see 406 of FIG. 4). The data show how the standard normal ranges (fixed for age > 20, marked as dashed lines in 406 of FIG. 4 (24–27) poorly describe patients’ empirical distributions. Variation in the distribution within- and outside- of the norm, for each lab test, was correlated with age and biological sex, providing quantitation for many known qualitative effects. For example, age-related anemia is quantified for males (52% at 80yo with hemoglobin levels below the standard norm of 14g/dl) and females (25% at 80yo below 12g/dl). Similarly, transitions linked with perimenopause are quantified for women (e.g. 16% at 40yo with RDW above 15). The data also characterize the degree of conformance with standard ranges that are set as clinical goals, for example for cholesterol (54% of the females at 60yo above the 2threshold) and glucose (38% of the males at 60yo above 100 mg/dl). In other cases, Inventor’s observations suggest transient age-linked trends that are only partially documented in the literature, as exemplified by age-linked alanine aminotransferase (ALT) trends (peaking in males at 35yo with 21% ALT higher than 40u/L, with slow decrease in older ages [28]. These examples provide evidence that by using VISOR-MD age-related lab test distributions may be one can systematically re-evaluated, their dynamics classified and new adjusted and quantitative reference values and be provided, for example, as described herein with reference to at least some embodiments.
The Clalit-reference model for age-adjusted lab test ranges Global biological sex- and age- trends in common laboratory tests have been previously described [26,29–32], and efforts have been done to associate these effects with physiological and pathological processes [33,34]. However, this previous analysis is different than at least some implementations described herein that use data of healthy individuals and exclude data of non-healthy individuals. Using the extensive Clalit data and the VISOR-MD pipeline Inventors estimated age- controlled distributions for 92 tests, providing a new high-resolution and consistent reference model for use in multiple applications. As shown in 408 of FIG. 4, Inventors normalized age-trends given the estimated empirical ranges of young adults (20yo) and organized all common lab tests age-trends into clusters sharing common age-linked trajectories. Interestingly, significant age-related variation was observed for 71 (81) out of 92 tests for males (females) (KS Holmes adjusted p-value < 0.01, D > 0.05 comparing young age 25-30 to old age 70-75). As expected, many tests showed monotonic age-dependent decrease (male clusters 1-4) or increase (male clusters 6-8). In females, trends representing alterations linked with menopause period dominated multiple clusters (post-menopause increase in clusters 8-10 and decrease in clusters 2-3). More surprisingly, age trends of transient reduction or transient elevation (clusters 3 and 8, respectively) were also observed for males. For example, bilirubin and TSH levels showed decrease toward minimal levels at ages 35-50. GGT and triglycerides on the other hand, showed transient increase peaking both at ages 45-50.To test how much of the above age-related dynamics can be explained by additional patient characteristics, Inventors analyzed BMI and socio-demographic factors. Inventors note that even after adjustment for BMI or socio-demographic classification, age related variation remained significant for 69 out of 92 tests. Thus, Inventors conclude that the definition of normal lab test ranges does not represent the empirical population distribution even when considering only individuals that are classified as generally healthy. The normal ranges might still reflect some absolute physiological goal for lab values, but the lack of concordance between these ranges and the empirical distributions derived from the cohort of healthy population, suggests an opportunity for improving patient evaluation by quantitative interpretation of test values given a richer reference model. Clinical impact of age-linked empirical lab test distributions Normal ranges for lab tests represent clinical goals and thresholds for triggering diagnostic and therapeutic intervention, but evidence-based optimization of such ranges in a patient specific fashion is hard to define. Inventors therefore sought to evaluate the clinical relevance of age-dependent lab tests distribution as refinements of absolute thresholds. Inventors classified healthy patients into bins defining age- and biological sex-normalized lab test percentiles and used 8 years of follow-up to compute all-cause mortality Cox hazard ratios in each bin (see 502 of FIG. 5). This was compared to hazard ratios computed for patients with tests above or below the absolute norms (see 502 of FIG. 5, red and blue dots). While patients were included in this analysis only when classified as generally healthy, Inventors discovered the surprising finding of remarkable and pervasive elevated risk given either high or low, but within- norm lab test percentiles for multiple test types. For example, RDW levels span a continuum of hazard ratios across the entire empirical distribution suggesting a quantitative effect generalizing previous observations [35,36]. In another example, Inventors corroborate (on a far larger sample size) that low ALT levels in the lower 30 percentiles of the empirical distribution are associated with elevated hazard ratios [37]. When multi-variate analysis of normalized tests is introduced, as demonstrated by umap analysis in 504 of FIG. 5, the potency of patient modelling given seemingly normal lab readout is further enhanced. These analyses suggest that quantitative interpretation of age-controlled lab tests can lead to more precise risk stratification, diagnostic protocols and even therapeutic decisions.
Conservation of personalized test values Within- norm variation, as shown in FIG. 5, diversify patients within their age groups and can be correlated with clinical outcome. Inventors expect patients to conserve some of their age- normalized lab readouts longitudinally, such that individualized changes rather than absolute levels may attain clinical relevance. In order to start and model such effects, Inventors defined the test personalization indices 650 of FIG. 6 as the longitudinal correlation of age- normalized tests for individuals resampled over 2, 5 or 10 years. Inventors hypothesized that personalization will differ when estimated from younger or older patient groups, and therefore calculated these indices separately stratified by age decade (see 604 of FIG. 6). Personalization index data demonstrated clearly that for most tests, patients conserve their normalized test levels across prolonged periods of time. As expected, lower personalization is observed for tests with small absolute variation (e.g., electrolytes, 606 of FIG. 6), but high personalized conservation is observed for tests in which inter-individual variation is pronounced and changing with age (e.g., MCH, MCV). Comparison of personalization indices that were computed for different tests for a cohort of 40yo males and an independent cohort of 70yo males (see trends in 606 of FIG. 6) showed that conservation of inter-individual variation is increasing with age for most tests. This can be explained by chronic aging processes that accumulate and affect clinical trajectories in older ages, even in patients who were classified as generally healthy. Analysis of matching trends in females showed similar trends in addition to noticeable perimenopause related effects. When personalization indices start declining for some tests in ages >70, this may be rationalized by a larger part of the population leaving the healthy cohort toward defined disease and/or poorer survival as suggested also by the stratified hazard analysis above. Taken together, data suggest that within norm test level conservation is apparent already in young adults (where genetic and heritable effects are likely to dominate it), but is increasing alongside possible chronic age- related processes in older ages. Both genetic and age-related effects can be informative for understanding the potential for disease.
Heritability of age-normalized lab test trends Inventors next sought out to delineate the contribution of genetic factors to test personalization trends. To that end, Inventors calculated narrow sense heritability (h2) for the age- and sex- normalized common labs in healthy individuals (i.e. probands 652 of 602 of FIG. 6) for whom test results were also available for both parents (range of N = 634 to 610,227, methods). h2 values range between 0.06 (LH) to 0.65 (MCH and Amylase). Inventors compared their h2 estimates to two published databases, RIFTEHR38 and UK Biobank (UKBB) [39,40], with overall correlation (pearson) of 0.5 and 0.37, respectively. Interestingly, Inventor’s hestimates were generally higher than the RIFTEHR and UK Biobank values. This might arise from the different methodologies used to estimate h2 – LD score regression for GWAS data in the UKBB [41], and indirect inference of family pedigrees in RIFTEHR [38], or from differences in the sampled population characteristics (e.g., the UKBB is enriched for older and healthier volunteers [42], in RIFTEHR the phenotypes were obtained from inpatient settings, and the Clalit cohort includes several genetically coherent groups). Inventors note that the estimations are based on phenotypes that are age- and sex- adjusted which allows to omit both age and sex as covariates in the linear model, a difference from the standard practice in the field. Inventors compared h2 value to personalization indices, showing high degree of concordance that was more noticeable for indices computed for younger individuals (r=0.8 for indices at age 25­30, r=0.68 for indices at age 70-75). In order to further elucidate the contribution of aging to the heritability of lab values, Inventors compared h2 values of pedigrees of individuals at age 20-with those aged 50-80 (see 608 of FIG. 6). Inventors observed a general trend of decreased estimated heritability with aging, which might arise from the contribution of accumulated pathological processes. Heritability analysis thereby provide additional support for the need in precision modelling of lab-test levels for older individuals, distinguishing the genetic contribution for lab-test personalization in younger age from age-related test diversification, and the determination of lab values trajectories prior to formal diagnosis of a disease.
Multivariate predictive models and personalized lab ranges. Inventors developed a set of computational predictors to compute patient expected age- linked test trajectories. Such calculators may formalize quantitatively the standard practice of examining patient lab histories prior to diagnosis or further testing. Inventors considered three regression strategies with increasing personalization for predicting new test results. In the simple single-lab/single-time model, predictions are based merely on the age- and sex- adjusted lab test value sampled 2-3 years before the current test (e.g., predict a patient’s HGB level at 1/1/20based on his/her HGB level sampled between 1/1/2008 and 1/1/2009). The multi-lab/single-time model could select additional predictive features from any other test for regressing the current test level, but again considered only tests obtained 2-3 years earlier. Finally, for inferring the multi-lab/multi-time model, Inventors performed imputation of missing data in 6 months resolution using a novel forward regression strategy (see Methods section), followed by construction of a regression model allowing unsupervised selection of features from the pool of all common lab test from eight time periods (2-6 years prior to current test) (see 702 of FIG. 7 ) . Inventors used cross validation to test the outcome of the three regression strategies in different age groups (see 704 of FIG. 7). Overall, Inventors derived R2 values between 0.03 and 0.78, with tests showing over 40% and 27 tests showing over 60% of the variance explained in the multi-lab/multi-time model. As already suggested by analysis of the personalization indices, the models described herein showed variable performance as a function of age. Multivariate and temporal modelling generally improved model accuracy (e.g., HGB, RDW, cholesterol), but for some tests (e.g. ALT) the usage of a single historical lab test was sufficient. The multivariate nature of the model was further illustrated by analysis of the features contributing to prediction of each lab test value. Successful regression of within-norm lab tests values opens the way for quantitative interpretation of new test results given patients’ lab histories. In addition to the standard comparison to absolute normal ranges, this advanced analysis strategy would allow highlighting cases where a new test result is significantly higher or lower from what should have been anticipated given the patient complete testing history, as highlighted in the proposed patient analytics report, for example, as described with reference to FIG. 3.
Quantitative prediction of future lab abnormalities. To further support the utility of personalized lab test models, Inventors adapted the above described experimental regression models for predicting future out-of-norm test results 850 (see 802 of FIG. 8). Inventors applied this predictive approach to the population of generally healthy individuals with within-norm current lab test, aiming at predicting whether an individual will present with out-of-norm (high or low depending on the test) lab testing in two-three years. Inventors estimated model performance by calculating the relative risk (RR) for presenting with out-of-range lab test given positive model classification (see 804 of FIG. 8), while controlling for possible testing biases (methods). This analysis confirmed the utility of using multi-variate and/or multi-time models for identifying potential patient deterioration in multiple physiological indices (e.g. ALT/AST/GGT, lipids and more). It shows that the power for predicting future abnormalities can exceed the power inferred for regressing within-norm levels. This suggest that quantitative modelling can guide decisions on future follow up and direct further testing to individuals with increased deterioration risk.

Claims (27)

1.WHAT IS CLAIMED IS:1. A computerized method for training a machine learning (ML) model for predicting a target value range of a laboratory (lab) test for a target individual, comprising:accessing a plurality of sample electronic medical records (EMR) of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results;selecting at least one value of a medical field stored in the EMR indicating a known pathology;screening the plurality of sample EMR matching the at least one value of the medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value and includes a sub-set of the plurality of sample EMR non-matching the at least one value that represent lab tests taken from individuals over time intervals when no known indication of pathology was recorded; andtraining a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test and that excludes the at least one value of the medical field.
2. The method of claim 1, wherein selecting at least one value of the medical field stored in the EMR indicating a known pathology comprises selecting at least one value of the medical field stored in the EMR correlated with a decreased survival of a subset of the plurality of sample individuals over a time interval relative to survival of the plurality of sample individuals.
3. The method of claim 2, wherein screening comprises excluding EMRs matching values of the medical field that are occurring at time stamps later than that of another field reporting a code of an International Classification of Disease (ICD) indicating a severe chronic medical condition maintained over the time interval.
4. The method of claim 3, further comprising excluding from the filtered dataset sample EMR that match at least one issued medication in the EMR associated with a relative risk above a threshold, of a first timestamp of the severe chronic condition appearing within a time range after a timestamp indicating initiation of the at least one issued medication, wherein the at least one issued medication is located at least in a threshold percentage of EMRs of subjects matching the severe chronic condition.
5. The method of claim 1, further comprising:selecting at least one value of a second medical field stored in the EMR correlated with a statistically significant change of the at least one lab test at a first time before a time stamp of the at least one value of the second medical field relative to a second time after the timestamp of the at least one value of the second medical field;wherein screening comprises screening the plurality of sample EMRs using the at least one value of the medical field and a combination including the at least one value of the second medical field and the at least one lab test, to obtain the filtered dataset.
6. The method of claim 5, wherein the at least one value of the second medical field includes an indication of at least one medication issued to the respective sample individual.
7. The method of claim 1, further comprising:generating a plurality of pairs covering a plurality of combinations, each pair including a combination of a respective historical lab test result and a respective medication;for each respective pair denoting a respective combination, assigning in association with the respective historical lab test result of the respective combination, an indication of whether the respective historical lab test result was obtained before or after a timestamp indicating initiation of administration of the medication of the respective combination;analyzing the plurality of pairs to identify a statistically significant difference in historical lab test results before and after the timestamp indicating initiation of issued medication;identifying a subset of pairs for which a statistically significant difference is identified;wherein screening comprises screening the plurality of sample EMRs using the at least one value of the medical field and the subset of pairs to obtain the filtered dataset.
8. The method of claim 1, wherein the target range is predicted for a current time interval, and further comprising:receiving a current value of the at least one lab test for the target individual obtained during the current time interval; andgenerating an alert when the current value of the at least one lab test is external to the target range of the at least one lab test of the target individual.
9. The method of claim 8, wherein the prediction indicative of the target range for the at least one lab test of the target individual is within a clinically defined normally range.
10. The method of claim 8, further comprising:generating instructions for presenting a graphical user interface (GUI) on a display, the GUI including a graph plot of the plurality of historical values of the at least one lab test on a time scale, the target range predicted for the at least one lab test presented as a range on the time scale, the current value of the at least one lab test presented at a same time on the time scale as the target range.
11. The method of claim 8, wherein the target range is within the normal range, and the current value of the at least one lab test is within the normal range.
12. The method of claim 8, wherein the historical lab test results of the filtered dataset are within a clinically defined normal range.
13. The method of claim 8, when the current value of the at least one lab test is external to the target range is selected from a group consisting of:(i) fasting blood glucose level and/or cholesterol level exceeding the target range, further comprising treating the target individual by at least one member selected from a group consisting of: dietary consultation, physical exercise, medication treatment,(ii) hemoglobin level lower than the target range, with mean corpuscular volume (MCV) above or below a second target range, further comprising prioritizing colonoscopy, and/or bone marrow biopsy, and/or iron or b12 supplement,(iii) calcium and/or vitamin D below the target range, further comprising treating the target individual with calcium and/or vitamin D supplements, and(iv) Creatinine levels out of the target range, further comprising initiating chronic kidney disease consultation.
14. The method of claim 1, wherein the at least one lab test comprises a combination of a plurality of lab tests, wherein the ML model generates a prediction for each of the plurality of lab tests of the combination, in response to an input of the plurality of historical values for the combination of the plurality of lab tests.
15. The method of claim 1, wherein the target range is predicted for a future time interval, and further comprising:generating an alert when at least a portion of the target range is predicted to be external to a clinically normal range defined for at least one lab test at the future time interval.
16. The method of claim 1, wherein the at least one value of the medical field includes an indication of at least one chronic medical condition diagnosis for the respective sample individual.
17. The method of claim 1, wherein selecting at least one value of the medical field, further comprises identifying that the at least one value of the medical field is continually present in the EMR for at least a time interval greater than a threshold.
18. The method of claim 1, wherein screening further comprises screening the plurality of sample EMR for an indication of pregnancy.
19. The method of claim 1, further comprising identifying sample EMRs storing the at least one value of the medical field during the time interval, and non-storing the at least one value of the medical field during a second time interval,wherein screening comprises screening the plurality of sample EMRs matching the at least one value of the medical field by excluding historical lab test results obtained during the time interval from the filtered dataset, and retaining in the filtered dataset historical lab test results obtained during the second time interval.
20. The method of claim 1, wherein each respective sample EMR is labelled with an indication of a birth date of the respective sample individual, and each of the plurality of historical lab test results is labelled with an indication of a date on which the at least one lab test was conducted; and further comprising:computing a respective age-normalized value of the respective sample individual for each of the plurality of historical lab test results according to the birth date of the respective sample individual and date of the respective historical lab test,wherein training the ML model comprises training the ML model on the filtered dataset for generating the target range for the at least one lab test of the target individual, in response to an input of a plurality of historical age-normalized values of the at least one lab test, and for each of the plurality of historical values.
21. The method of claim 1, wherein the plurality of sample EMRs are selected from a set of sample EMRs matching at least one demographic parameter;wherein the input of the target EMR into the ML model matches the indication of at least one demographic parameter.
22. The method of claim 1, wherein each of the plurality of historical lab test results is labelled with a timestamp during which the at least one lab test was performed; andwherein screening further comprises excluding from the filtered dataset sample EMRs having the timestamp within a timestamp range indicating that the at least one lab test was performed outside of normal working hours.
23. A device for training a machine learning (ML) model for predicting a target value range of a laboratory (lab) test for a target individual, comprising:at least one hardware processor executing a code for:accessing a plurality of sample electronic medical records (EMR) of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results;selecting at least one value of a medical field stored in the EMR indicating a known pathology;screening the plurality of sample EMR matching the at least one value of the medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value and includes a sub-set of the plurality of sample EMR non-matching the at least one value that represent lab tests taken from individuals over time intervals when no known indication of pathology was recorded; andtraining a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test and that excludes the at least one value of the medical field.
24. A computer program product for training a machine learning (ML) model for predicting a target value range of a laboratory (lab) test for a target individual, comprising a non-transitory medium storing a computer program which, when executed by at least one hardware processor, cause the at least one hardware processor to perform: accessing a plurality of sample electronic medical records (EMR) of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results;selecting at least one value of a medical field stored in the EMR indicating a known pathology;screening the plurality of sample EMR matching the at least one value of the medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value and includes a sub-set of the plurality of sample EMR non-matching the at least one value that represent lab tests taken from individuals over time intervals when no known indication of pathology was recorded; andtraining a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test and that excludes the at least one value of the medical field.
25. A computerized method for training a machine learning (ML) model for predicting a target value range of a lab test for a target individual, comprising:accessing a plurality of sample EMRs of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results for at least one lab test;selecting at least one value of a medical field stored in the EMR correlated with a statistically significant change of the at least one lab test at a first time before a time stamp of the at least one value of the medical field relative to a second time after the timestamp of the at least one value of the medical field;screening the plurality of sample EMRs matching the at least one value of the medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value and includes a sub-set of the plurality of sample EMR non-matching the at least one value; andtraining a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test that excludes the at least one value of the medical field.
26. The method of claim 25, wherein the at least one value of the medical field includes an indication of at least one medication administered to the respective sample individual.
27. A method for training a machine learning (ML) model for predicting a target value range of a lab test for a target individual, comprising:accessing a plurality of sample electronic medical records (EMR) of a plurality of sample individuals, each respective sample EMR of a respective sample individual including a plurality of historical lab tests results for at least one lab test;selecting at least one of: (i) at least one value of a first medical field stored in the EMR correlated with decreased survival of a subset of the plurality of sample individuals over a time interval, and (ii) selecting at least one value of a second medical field stored in the EMR correlated with a statistically significant change of the at least one lab test at a first time before a time stamp of the at least one value of the medical field relative to a second time after the timestamp of the at least one value of the medical fieldscreening the plurality of sample EMR matching the at least one value of the first medical field and/or second medical field to obtain a filtered dataset that excludes the plurality of sample EMR matching the at least one value of the first medical field and/or second medical field and includes a sub-set of the plurality of sample EMR non-matching the at least one value of the first medical field and/or second medical field; andtraining a ML model on the filtered dataset for generating a prediction indicative of a target range for the at least one lab test of a target individual, in response to an input of a target EMR of the target individual including a plurality of historical values of the at least one lab test and that excludes the at least one value of the first medical field and/or the second medical field. Roy S. Melzer, Adv. Patent Attorney G.E. Ehrlich (1995) Ltd. 11 Menachem Begin Road 5268104 Ramat Gan
IL280496A 2021-01-28 2021-01-28 Machine learning models for predicting laboratory test results IL280496A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
IL280496A IL280496A (en) 2021-01-28 2021-01-28 Machine learning models for predicting laboratory test results
PCT/IL2022/050115 WO2022162660A1 (en) 2021-01-28 2022-01-26 Machine learning models for predicting laboratory test results

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
IL280496A IL280496A (en) 2021-01-28 2021-01-28 Machine learning models for predicting laboratory test results

Publications (1)

Publication Number Publication Date
IL280496A true IL280496A (en) 2022-08-01

Family

ID=80447213

Family Applications (1)

Application Number Title Priority Date Filing Date
IL280496A IL280496A (en) 2021-01-28 2021-01-28 Machine learning models for predicting laboratory test results

Country Status (2)

Country Link
IL (1) IL280496A (en)
WO (1) WO2022162660A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117747100A (en) * 2023-12-11 2024-03-22 南方医科大学南方医院 System for predicting occurrence risk of obstructive sleep apnea

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160283686A1 (en) * 2015-03-23 2016-09-29 International Business Machines Corporation Identifying And Ranking Individual-Level Risk Factors Using Personalized Predictive Models
US20200395129A1 (en) * 2017-08-15 2020-12-17 Medial Research Ltd. Systems and methods for identification of clinically similar individuals, and interpretations to a target individual
US10902953B2 (en) * 2013-10-08 2021-01-26 COTA, Inc. Clinical outcome tracking and analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10902953B2 (en) * 2013-10-08 2021-01-26 COTA, Inc. Clinical outcome tracking and analysis
US20160283686A1 (en) * 2015-03-23 2016-09-29 International Business Machines Corporation Identifying And Ranking Individual-Level Risk Factors Using Personalized Predictive Models
US20200395129A1 (en) * 2017-08-15 2020-12-17 Medial Research Ltd. Systems and methods for identification of clinically similar individuals, and interpretations to a target individual

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117747100A (en) * 2023-12-11 2024-03-22 南方医科大学南方医院 System for predicting occurrence risk of obstructive sleep apnea
CN117747100B (en) * 2023-12-11 2024-05-14 南方医科大学南方医院 System for predicting occurrence risk of obstructive sleep apnea

Also Published As

Publication number Publication date
WO2022162660A1 (en) 2022-08-04

Similar Documents

Publication Publication Date Title
James et al. Derivation and external validation of prediction models for advanced chronic kidney disease following acute kidney injury
Kruse et al. Machine learning principles can improve hip fracture prediction
US20180374583A1 (en) Nomogram and survival predictions for pancreatic cancer
US20160283686A1 (en) Identifying And Ranking Individual-Level Risk Factors Using Personalized Predictive Models
US20050119534A1 (en) Method for predicting the onset or change of a medical condition
CN105229471B (en) For analyzing the system and method for determining preeclampsia risk based on biochemical biomarker
KR20190062461A (en) System and method for medical data mining
EP2628113A1 (en) Healthcare information technology system for predicting development of cardiovascular conditions
Jiang et al. An explainable machine learning algorithm for risk factor analysis of in-hospital mortality in sepsis survivors with ICU readmission
CN105209920B (en) For analyzing the system and method for determining diabetes risk based on biochemical biomarker
Jiang et al. Readmission risk trajectories for patients with heart failure using a dynamic prediction approach: retrospective study
Cohen et al. Personalized lab test models to quantify disease potentials in healthy individuals
Shaw et al. Timing of onset, burden, and postdischarge mortality of persistent critical illness in Scotland, 2005–2014: a retrospective, population-based, observational study
Ma et al. Using the shapes of clinical data trajectories to predict mortality in ICUs
Kharbanda et al. A clinical score to predict appendicitis in older male children
WO2021044594A1 (en) Method, system, and apparatus for health status prediction
Sun et al. Towards artificial intelligence-based learning health system for population-level mortality prediction using electrocardiograms
Rhee et al. Development and validation of a deep learning based diabetes prediction system using a nationwide population-based cohort
Chen et al. Derivation and external validation of machine learning-based model for detection of pancreatic cancer
US11335461B1 (en) Predicting glycogen storage diseases (Pompe disease) and decision support
Roberts et al. Vision for improving pregnancy health: innovation and the future of pregnancy research
WO2022162660A1 (en) Machine learning models for predicting laboratory test results
US20200176118A1 (en) Methods and systems for septic shock risk assessment
Fu et al. Utilizing timestamps of longitudinal electronic health record data to classify clinical deterioration events
US20160180049A1 (en) Managing medical examinations in a population