US20180285526A1 - System and method for phenotype vector manipulation of medical data - Google Patents
System and method for phenotype vector manipulation of medical data Download PDFInfo
- Publication number
- US20180285526A1 US20180285526A1 US15/478,282 US201715478282A US2018285526A1 US 20180285526 A1 US20180285526 A1 US 20180285526A1 US 201715478282 A US201715478282 A US 201715478282A US 2018285526 A1 US2018285526 A1 US 2018285526A1
- Authority
- US
- United States
- Prior art keywords
- phenotype
- patient
- cohort
- vector
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 95
- 238000000034 method Methods 0.000 title claims description 66
- 230000015654 memory Effects 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 30
- 230000008569 process Effects 0.000 description 25
- 238000004891 communication Methods 0.000 description 18
- 201000010099 disease Diseases 0.000 description 17
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 17
- 238000013461 design Methods 0.000 description 13
- 239000003814 drug Substances 0.000 description 11
- 238000004458 analytical method Methods 0.000 description 10
- 206010012601 diabetes mellitus Diseases 0.000 description 10
- 229940079593 drug Drugs 0.000 description 10
- NOESYZHRGYRDHS-UHFFFAOYSA-N insulin Chemical compound N1C(=O)C(NC(=O)C(CCC(N)=O)NC(=O)C(CCC(O)=O)NC(=O)C(C(C)C)NC(=O)C(NC(=O)CN)C(C)CC)CSSCC(C(NC(CO)C(=O)NC(CC(C)C)C(=O)NC(CC=2C=CC(O)=CC=2)C(=O)NC(CCC(N)=O)C(=O)NC(CC(C)C)C(=O)NC(CCC(O)=O)C(=O)NC(CC(N)=O)C(=O)NC(CC=2C=CC(O)=CC=2)C(=O)NC(CSSCC(NC(=O)C(C(C)C)NC(=O)C(CC(C)C)NC(=O)C(CC=2C=CC(O)=CC=2)NC(=O)C(CC(C)C)NC(=O)C(C)NC(=O)C(CCC(O)=O)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)NC(=O)C(CC=2NC=NC=2)NC(=O)C(CO)NC(=O)CNC2=O)C(=O)NCC(=O)NC(CCC(O)=O)C(=O)NC(CCCNC(N)=N)C(=O)NCC(=O)NC(CC=3C=CC=CC=3)C(=O)NC(CC=3C=CC=CC=3)C(=O)NC(CC=3C=CC(O)=CC=3)C(=O)NC(C(C)O)C(=O)N3C(CCC3)C(=O)NC(CCCCN)C(=O)NC(C)C(O)=O)C(=O)NC(CC(N)=O)C(O)=O)=O)NC(=O)C(C(C)CC)NC(=O)C(CO)NC(=O)C(C(C)O)NC(=O)C1CSSCC2NC(=O)C(CC(C)C)NC(=O)C(NC(=O)C(CCC(N)=O)NC(=O)C(CC(N)=O)NC(=O)C(NC(=O)C(N)CC=1C=CC=CC=1)C(C)C)CC1=CN=CN1 NOESYZHRGYRDHS-UHFFFAOYSA-N 0.000 description 8
- 238000011160 research Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 230000015572 biosynthetic process Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000007717 exclusion Effects 0.000 description 6
- 230000036541 health Effects 0.000 description 6
- 238000003786 synthesis reaction Methods 0.000 description 6
- 238000004519 manufacturing process Methods 0.000 description 5
- 102000004877 Insulin Human genes 0.000 description 4
- 108090001061 Insulin Proteins 0.000 description 4
- 208000006673 asthma Diseases 0.000 description 4
- 229940125396 insulin Drugs 0.000 description 4
- 230000002123 temporal effect Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000002560 therapeutic procedure Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000011835 investigation Methods 0.000 description 3
- XZWYZXLIPXDOLR-UHFFFAOYSA-N metformin Chemical compound CN(C)C(=N)NC(N)=N XZWYZXLIPXDOLR-UHFFFAOYSA-N 0.000 description 3
- 229960003105 metformin Drugs 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 208000006096 Attention Deficit Disorder with Hyperactivity Diseases 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 239000008280 blood Substances 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- HVYWMOMLDIMFJA-DPAQBDIFSA-N cholesterol Chemical compound C1C=C2C[C@@H](O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2 HVYWMOMLDIMFJA-DPAQBDIFSA-N 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 230000009395 genetic defect Effects 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 208000015181 infectious disease Diseases 0.000 description 2
- 229940125369 inhaled corticosteroids Drugs 0.000 description 2
- 238000002483 medication Methods 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 238000009595 pap smear Methods 0.000 description 2
- 201000010065 polycystic ovary syndrome Diseases 0.000 description 2
- 239000002994 raw material Substances 0.000 description 2
- 238000000611 regression analysis Methods 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 208000036864 Attention deficit/hyperactivity disease Diseases 0.000 description 1
- 208000010392 Bone Fractures Diseases 0.000 description 1
- 208000017667 Chronic Disease Diseases 0.000 description 1
- 206010020772 Hypertension Diseases 0.000 description 1
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 208000015802 attention deficit-hyperactivity disease Diseases 0.000 description 1
- 210000000227 basophil cell of anterior lobe of hypophysis Anatomy 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000002316 cosmetic surgery Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000012517 data analytics Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002405 diagnostic procedure Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000035622 drinking Effects 0.000 description 1
- 208000030172 endocrine system disease Diseases 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 208000030533 eye disease Diseases 0.000 description 1
- MGNNYOODZCAHBA-GQKYHHCASA-N fluticasone Chemical compound C1([C@@H](F)C2)=CC(=O)C=C[C@]1(C)[C@]1(F)[C@@H]2[C@@H]2C[C@@H](C)[C@@](C(=O)SCF)(O)[C@@]2(C)C[C@@H]1O MGNNYOODZCAHBA-GQKYHHCASA-N 0.000 description 1
- 229960002714 fluticasone Drugs 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000005802 health problem Effects 0.000 description 1
- 238000002649 immunization Methods 0.000 description 1
- 230000003053 immunization Effects 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000008407 joint function Effects 0.000 description 1
- 238000009533 lab test Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000013515 script Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000000391 smoking effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- 230000003936 working memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/20—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
-
- G06F19/324—
-
- G06F19/322—
-
- G06F19/363—
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Definitions
- Embodiments of the present invention generally relate to observational testing, and, in particular, to a system and method for post-selection variable construction for data in an observational test.
- Observational studies are an important category of study designs. For some kinds of investigative questions (e.g., related to plastic surgery), randomized controlled trials may not always be indicated or ethical to conduct. Instead, observational studies may be the next best method to address these types of questions. Well-designed observational studies may provide results similar to randomized controlled trials, challenging the belief that observational studies are second-rate. Cohort studies and case-control studies are two primary types of observational studies that aid in evaluating associations between diseases and exposures.
- RCTs randomized controlled trials
- EBM evidence-based medicine
- RCT methodology which was first developed for drug trials, can be difficult to conduct for some investigations (e.g., surgical cases).
- well-designed observational studies recognized as level II or III evidence, can play an important role in deriving evidence for such investigations.
- Results from observational studies are often criticized for being vulnerable to influences by unpredictable confounding factors.
- comparable results between observational studies and RCTs are achievable. Observational studies can also complement RCTs in hypothesis generation, establishing questions for future RCTs, and defining clinical conditions.
- observational studies fall under the category of analytic study designs and are further sub-classified as observational or experimental study designs.
- the goal of analytic studies is to identify and evaluate causes or risk factors of diseases or health-related events.
- the differentiating characteristic between observational and experimental study designs is that in the latter, the presence or absence of under-going an intervention defines the groups.
- the investigator does not intervene and rather simply “observes” and assesses the strength of the relationship between an exposure and disease variable.
- Three types of observational studies include cohort studies, case-control studies, and cross-sectional studies. Case-control and cohort studies offer specific advantages by measuring disease occurrence and its association with an exposure by offering a temporal dimension (i.e., prospective or retrospective study design).
- Cross-sectional studies also known as prevalence studies, examine the data on disease and exposure at one particular time point. Because the temporal relationship between disease occurrence and exposure cannot be established, cross-sectional studies cannot assess the cause and effect relationship.
- cohort is used in epidemiology to define a set of people followed over a period of time.
- cohort refers to a group of people with defined characteristics who are followed up to determine incidence of, or mortality from, some specific disease, all causes of death, or some other outcome.
- a well-designed cohort study can provide powerful results.
- an outcome-free or disease-free study population is first identified by the exposure or event of interest, and then is followed in time until the disease or outcome of interest occurs. Because exposure is identified before the outcome, cohort studies have a temporal framework to assess causality and thus have the potential to provide the strongest scientific evidence.
- a cohort study is particularly advantageous for examining rare exposures because subjects are selected by their exposure status, and rates of disease may be calculated in exposed and unexposed individuals over time (e.g. incidence, relative risk). Additionally, an investigator can examine multiple outcomes simultaneously. However, the cohort study may be susceptible to selection bias.
- a cohort study may be large, particularly to study rare exposures, and require a large sample size and a potentially long follow-up duration of the study design, resulting in a costly endeavor.
- Cohort studies may be prospective or retrospective. Prospective studies are carried out from the present time into the future. Because prospective studies are designed with specific data collection methods, it has the advantage of being tailored to collect specific exposure data and may be more complete. A disadvantage of a prospective cohort study may include the long follow-up period while waiting for events or diseases to occur. Thus, this study design is inefficient for investigating diseases with long latency periods and is vulnerable to a high loss to follow-up rate.
- retrospective cohort studies are better indicated for timely and inexpensive study design.
- Retrospective cohort studies also known as historical cohort studies, are carried out at the present time and look to the past to examine medical events or outcomes.
- a cohort of subjects, selected based on exposure status, is chosen at the present time, and outcome data (i.e. disease status, event status), which was measured in the past, are reconstructed for analysis.
- outcome data i.e. disease status, event status
- An advantage of the retrospective study design analysis is the immediate access to the data.
- the study design is comparatively less costly and shorter than prospective cohort studies.
- disadvantages of retrospective study design include limited control the investigator has over data collection. The existing data may be incomplete, inaccurate, or inconsistently measured between subjects, for example, by not being uniformly recorded for all subjects.
- a cohort study defines the selected group of subjects by predetermined criteria (e.g., exposure to a substance, or having a particular medical condition, etc.) at the start of the investigation.
- a critical characteristic of subject selection is to have both the exposed and unexposed groups be selected from the same source population.
- Subjects who are not at risk for developing the outcome should be excluded from the study.
- the source population is determined by practical considerations, such as sampling. Subjects may be effectively sampled from the hospital, be members of a community, or from a doctor's individual practice. A subset of these subjects will be eligible for the study.
- multiple variables describing a person e.g., age, gender, body mass index (BMI), whether or not the patient has diabetes, etc.
- the multiple variables effectively describe criteria that are used as inputs to analysis processes to establish assertions about the statistical nature of the patients in a cohort study.
- the multiple variables may be represented as a patient vector, which describes the patient's various medical, geographical and demographic variables.
- the variables generally are produced from the previously described population's raw data, and often is created using covariates.
- a problem with this scenario is that the patient cohort definition and the output patient vector are produced in very different ways. Both the patient cohort definition and the output patient vectors require a deep understanding of the underlying data and how to construct clinical criteria in that data, both for data selection and for analytical variable creation. This requires full unfettered access to this data to produce the necessary criteria. This activity would normally be undertaken using scripts and code on a per study, per data set basis.
- embodiments in accordance with the present disclosure define phenotypes in order to define a fundamental atomic building block to enable both data subset creation and vector creation, with phenotype vectors being the primary raw material of EMR-based data science.
- Embodiments provide a systematic process to determine the most significant factors that can be used to approximate a patient population group.
- Embodiments in accordance with the present disclosure provide a cohort definition and selection system for a computer having a memory, a central processing unit and a display, the system including: a cohort definition module to configure the memory according to a phenotype vector.
- the phenotype vector includes a patient ID to uniquely associate the phenotype vector to a patient, a plurality of demographic dimension fields, each demographic dimension field to describe a respective demographic aspect of the patient, a plurality of calculated dimension fields to describe a calculated information related to the patient, a plurality of, potentially recursively defined phenotype-based dimension fields, each phenotype-based dimension field to indicate relevance of the respective phenotype-based dimension field to the patient.
- FIGS. 1A, 1B, 1C illustrate vector representations of patient data
- FIG. 2 illustrates an exemplary format for patient data as a phenotype, in accordance with an embodiment of the present disclosure
- FIG. 3 illustrates an exemplary recursive phenotype definition specific to diabetes in accordance with an embodiment of the present disclosure
- FIG. 4 depicts at a high level of abstraction a system in accordance with an embodiment of the present disclosure
- FIG. 5 illustrates a process flow in accordance with an embodiment of the present disclosure
- FIG. 6 illustrates a process for a second stage of processing, in accordance with an embodiment of the present disclosure
- FIG. 7 illustrates components of computing terminal, in accordance with an embodiment of the present disclosure
- FIG. 8A illustrates a simplified set of EMR records for persons and events as known in the art
- FIG. 8B illustrates an example of the output using methods known in the art
- FIG. 9A illustrates a vector-based patient definition, in accordance with an embodiment of the present disclosure.
- FIG. 9B illustrates an example of an output, in accordance with the present disclosure.
- the disclosure will be illustrated below in conjunction with an exemplary digital information system. Although well suited for use with, e.g., a system using a server(s) and/or database(s), the disclosure is not limited to use with any particular type of system or configuration of system elements. Those skilled in the art will recognize that the disclosed techniques may be used in any system or process in which it is desirable whenever multi-dimensional criteria are used to make an imperfect matching selection from among an available population that shares at least some of these criteria.
- module refers generally to a logical sequence or association of steps, processes or components.
- a software module may comprise a set of associated routines or subroutines within a computer program.
- a module may comprise a substantially self-contained hardware device.
- a module may also comprise a logical set of processes irrespective of any software or hardware implementation.
- a module that performs a function also may be referred to as being configured to perform the function, e.g., a data module that receives data also may be described as being configured to receive data.
- Configuration to perform a function may include, for example: providing and executing sets of computer code in a processor that performs the function; providing provisionable configuration parameters that control, limit, enable or disable capabilities of the module (e.g., setting a flag, setting permissions, setting threshold levels used at decision points, etc.); providing or removing a physical connection, such as a jumper to select an option, or to enable/disable an option; attaching a physical communication link; enabling a wireless communication link; providing electrical circuitry that is designed to perform the function without use of a processor, such as by use of discrete components and/or non-CPU integrated circuits; setting a value of an adjustable component (e.g., a tunable resistance or capacitance, etc.), energizing a circuit that performs the function (e.g., providing
- the term “transmitter” may generally comprise any device, circuit, or apparatus capable of transmitting a signal.
- the term “receiver” may generally comprise any device, circuit, or apparatus capable of receiving a signal.
- the term “transceiver” may generally comprise any device, circuit, or apparatus capable of transmitting and receiving a signal.
- the term “signal” may include one or more of an electrical signal, a radio signal, an optical signal, an acoustic signal, and so forth.
- Non-volatile media includes, for example, NVRAM, or magnetic or optical disks.
- Volatile media includes dynamic memory, such as main memory.
- Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, magneto-optical medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, solid state medium like a memory card, any other memory chip or cartridge, a carrier wave as described here-inafter, or any other medium from which a computer can read.
- a digital file attachment to e-mail or other self-contained information archive or set of archives is considered a distribution medium equivalent to a tangible storage medium.
- the computer-readable media is configured as a database
- the database may be any type of database, such as relational, hierarchical, object-oriented, and/or the like. Accordingly, the disclosure is considered to include a tangible storage medium or distribution medium and prior art-recognized equivalents and successor media, in which the software implementations of the present disclosure are stored.
- large-scale routine healthcare databases are amassed and maintained based upon data gathered by healthcare providers and healthcare insurers.
- routine health care such as a yearly checkup, regularly-scheduled pap smears or mammograms, or visits for acute but relatively minor problems such as an infection, stitches, or broken bone
- Healthcare records may also include information related to non-routine care such as emergency room visits, hospital admissions, or other serious healthcare events.
- the healthcare records may document the progress over time of chronic conditions such as cholesterol levels, high blood pressure, and the like.
- the healthcare records may also include demographic information such as age, ethnicity, height, weight, and so forth.
- such healthcare records may include sources such as the Clinical Practice Research Datalink (CPRD) primary care database (GOLD), the hospital episode statistics (HES) and the Office for National Statistics (ONS) mortality data.
- CPRD Clinical Practice Research Datalink
- GOLD Clinical Practice Research Datalink
- HES hospital episode statistics
- OOS Office for National Statistics
- the CPRD established (initially as GPRD) in the UK in 1987, is a medical records database that general practitioners (GPs) use as the primary means of tracking patient clinical information.
- the total population in the CPRD exceeds nine million patients with over 35 million person-years of follow-up between 1987 and 2002.
- the CPRD which contains information on diagnoses and medications, was established with the intent of allowing researchers to conduct high quality epidemiologic studies and has been used in more than 200 peer-reviewed publications. All information is recorded by the GP or a member of the office staff as part of the patient's medical record.
- GPs are trained in data entry and their data are reviewed by administrators at the CPRD to ensure that they are of sufficient quality for research studies.
- the definition of the relevant population under study is the first step and an important step.
- There may be more than one relevant population for example, a first population that has developed a particular condition, and a second population that has not developed the particular condition as of the time of selection.
- the selection criteria form an important part of protocols (i.e., population criteria and analysis plan) used for clinical trial and health outcomes studies.
- Patient data forms the basis of many statistical analysis techniques.
- a problem of the background art is that patient data is seldom available in vectorized form without significant data manipulation.
- Patient data is typically transactional and time-based (i.e., “longitudinal”).
- Patient data primarily includes two classes of data, i.e., “people” and “events”.
- People data refers to the patient or enrollee (e.g., a spouse for spousal insurance coverage), enrollment related data (e.g., dates of coverage, exclusions, deductibles, employer, etc.).
- Event data refers to things that happen to patients, e.g., diagnoses, therapies, procedures, etc.
- the data manipulation tends to cover two primary activities. First, it may refer to cutting a subset of data from source databases that are relevant to a study being undertaken. Second, it may refer to creating a research-ready data format for that data, i.e., a vector-based data format that can be used as input to the processes and calculations of data science.
- data wrangling is low-level labor-intensive, data set specific activity, thus a higher-level, data set portable, less labor-intensive method is needed.
- embodiments improve upon the background art by recognizing that if the data science processes had been defined with respect to standardized vector formats, then the processes should be portable across different data-sets. This positions a vector format as a central, data-set portable pivot point for data science.
- valuable data science processes e.g., cohort matching, regression analysis and clustering, described in FIGS. 1A-1C , respectively
- FIGS. 1A-1C may be applied to the vector formats, and help enable the analysis and processes to be more portable from one analytic study to another.
- Embodiments in accordance with the present disclosure convert medical data to phenotype vectors.
- data manipulation and vector-production may be largely automated, thus enabling a dramatic increase in data science analytic output, e.g., a four-fold capacity increase may be realized.
- embodiments help enable portability of data and analytic processes, as opposed to processes tied to a data format that is specific only to a predetermined database.
- High-level tools for Phenotype vector production have a potential to drive significant gains in output and productivity of data analysis.
- Data with multiple attributes may be represented as a vector in a multidimensional space, with each dimension of the vector representing one attribute, and taking on values within an allowable range of values for the attribute.
- the vector at least in two or three dimensions may be represented as an arrow, with a magnitude and direction in an axis corresponding to the sign and magnitude of a corresponding dimension of the vector.
- Equation (1) represents the Euclidian geometric distance between the vectors. More generally, different metric functions may be used to define the distance between the vectors taking into account the statistical properties of each dimension for example. Most generally the distance between X and Y is given by a metric function M as shown below in Equation (2):
- a weighting function may be applied to the difference in each dimension, and for an overall sum. Weighting functions may be useful if the respective vector dimensions have unequal importance for the purpose of patient selection.
- a distance function i.e., error function
- G( ), H( ) and I( ) may be, e.g., triangle functions, exponential decay functions, step functions, etc.
- function F( ) may be, e.g., a summation function, a multiplication, a root, a power, a ratio, or some combination thereof may be used for some dimensions compared to other dimensions, e.g., in order to give unequal weight to different dimensions.
- Equation (3) may be extended to additional dimensions by use of additional weighting functions.
- a distance metric may include one or more of a Mahalanobis distance metric and a joint weighting function of more than one dimension.
- a joint function may be useful if, e.g., some dimensions are cross-correlated. For example, separate dimensions for patient weight and patient BMI may be expected to be cross-correlated.
- Equations (1)-(3) are useful because all data science is grounded in some underlying formal mathematical theory, and that mathematical theory is almost entirely vector based.
- Patient characteristics may be represented as a multi-dimensional vector.
- Patient characteristics may include sociodemographic factors (e.g., age, sex, place of residence, etc.), clinical factors (e.g., comorbidities, medical history, genetic history, blood type, medications used in the week prior to presentation, functional status, immunization history, smoking status, drinking status, etc.), and laboratory data. Dozens of characteristics may be relevant or possibly relevant. Relevancy may be dependent upon the type of study and/or objective of the study, and may be informed by existing medical knowledge. For example, patient weight may be more relevant to a diabetes study than patient eye characteristics, but patient eye characteristics may have more relevance to a study of eye disease. In this case, the selection criteria may give greater weight to characteristics relevant to an objective of selecting the cohort.
- FIG 1A illustrates three patient vectors (i.e., e 1 , e 2 , e 3 ) for an exposed cohort, compared to three nearest possible matches in a matched cohort (i.e., m 1 , m 2 , m 3 ).
- Matched patients are found by looking for patient vectors which are nearest in space to each patient in the exposed group, with “nearest” being calculated by a relationship such as one of Equations (1)-(3).
- additional techniques used for cohort matching e.g., propensity score matching, principle component analysis, coarsened exact matching, and so forth. Even though these techniques differ in the metric they use to measure the distance between two patient vectors, they all still use vectors as their input.
- FIG. 1B illustrates usage of a regression analysis technique usable in predictive or interpolative analytics.
- p 1 , p 2 and p 3 are three patient vectors in a training data set
- p 4 may represent a predicted trend given the input data.
- a mathematical model e.g., a linear trend, a polynomial trend, a seasonally adjusted trend, etc., that may be formulated with knowledge of the underlying causes of the trend
- a best fit mathematical equation may be found for the points in space represented by the input vectors and the output to be predicted. These equations then are used to predict the output or outcome for arbitrary patients as represented by their input vectors.
- FIG. 1C illustrates an application of clustering processes, which may be used regularly in predictive analytics to predict potential markets for products.
- p 1 , p 2 and p 3 may represent a first market sector or cluster of similar subjects
- p 4 , p 5 and p 6 may represent a second market sector or cluster of similar subjects.
- a distance between points in space represented by vectors is used to identify close neighbors and hence generate clusters of subjects that may be regarded as ‘alike’.
- each patient characteristic over a population of patients may be expressed as a statistic that represents the population as a whole.
- the statistic may be in a form such as a histogram, a series of numeric ranges (e.g., 40-50 years old; 50-60 years old; 150-160 lbs; 160-170 lbs; etc), a series of qualitative ranges (e.g., non-drinker vs. social drinker vs. heavy drinker, etc.), and so forth.
- Other mathematical representations of the multi-dimensional vector may be possible.
- Patient characteristics may not be independent of each other, e.g., selection of a female gender characteristic may result in a smaller and lighter population of patients compared to a selection of a male gender characteristic.
- the data is complex and highly dimensional. researchers have to make assumptions, based upon science, intuition or other data analysis, that involve structure that is believed to exist in the data but that cannot be observed directly. The data sets are large and growing with a never-ending stream of new data.
- ICD-10 International Statistical Classification of Diseases and Related Health Problems
- WHO World Health Organization
- ICD-10 codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases.
- the code set allows more than 14,400 different codes and permits the tracking of many new diagnoses.
- the codes can be expanded to over 16,000 codes by using optional sub-classifications.
- the detail reported by ICD can be further increased, with a simplified multi-axial approach, by using codes meant to be reported in a separate data field.
- Read code Another population coding system is the Read code, which is the standard clinical terminology system used in General Practice in the United Kingdom (UK).
- Read codes support detailed clinical encoding of multiple patient phenomena including: occupation; social circumstances; ethnicity and religion; clinical signs, symptoms and observations; laboratory tests and results; diagnoses; diagnostic, therapeutic or surgical procedures performed; and a variety of administrative items (e.g. whether a screening recall has been sent and by what communication modality, or whether an item of service fee has been claimed). It therefore includes but goes significantly beyond the expressivity of a diagnosis coding system.
- synthesis of population selection rules also must be performed manually by such an expert. Synthesis is known as a process of reducing from potentially hundreds of patient population codes to a much smaller set of medical factors, the factors being referred to as inclusion factors or exclusion factors. For example, for a predetermined asthma population (e.g., patients that were initially diagnosed between 12-17 years of age) a medical researcher may decide to look at only patients who were treated with either of two drugs: inhaled corticosteroids (ICS) or fluticasone (i.e., an example of an inclusion criterion). Each of those drugs will have a specific code which usually less recognizable to medical researchers than the drug name itself.
- ICS inhaled corticosteroids
- fluticasone i.e., an example of an inclusion criterion
- a medical researcher may also set another rule to study only patients who were treated in a primary care setting.
- a rule to narrow a study only to patients who were treated in a primary care setting may not be significant because virtually all asthma patients are treated in a primary care setting and thus fails to narrow the population much in practice.
- Manual synthesis may fail to recognize that such a rule is not significantly meaningful.
- manual synthesis may include such a criterion whereas an automated method may recognize that the criterion is not significantly meaningful and thus would not include the criterion in a summary.
- this is tedious to construct and difficult to tweak as a desired analytic inquiry changes.
- Embodiments in accordance with the present disclosure provide building blocks that may be useful to construct a patient vector to describe each respective patient, and to use the patient vectors to identify patient cohorts for further study.
- Embodiments may leverage an advantage that arises from having a common vector format used by multiple scientific groups. Embodiments will speed up the research process, allowing a deeper understanding of the methods applied to the common vector format, and allow patient descriptions to be transferred easily between individuals and computer systems.
- Embodiments build, extract, and store a common phenotype vector based on multiple patient medical databases, is reusable across multiple projects or studies, and is formatted in a way that isolates users from the underlying data.
- Embodiments in accordance with the present disclosure address a problem of vectorizing patient data by creating a framework to define the vector forms, and a system to convert old data to the vector form, and/or enforce the vector form for new data.
- Phenotypes and phenotype vectors are a useful paradigm to create or reformat vectorized patient data, to define the dimensions of those vectors in a portable manner, and to perform data science on patient data.
- a phenotype may be defined as a set of observable characteristics of an individual resulting from the interaction of its genotype with the environment. Embodiments provide a specific implementation that enables rapid, generalized phenotype-vector production from EMR databases. More generally, a phenotype may be defined as an arbitrary Boolean combination of demographic information, code lists, or lists of values representing conditions, drugs, observations, procedures etc. Each code or value list may include some absolute or relative time (i.e., temporal) constraints, and we may additionally specify time relationships between individual lists, e.g., people who have a severe asthma diagnosis after being diagnosed with ADHD.
- FIG. 2 illustrates an exemplary format for patient data as phenotype 200 , in accordance with an embodiment of the present disclosure.
- Phenotype 200 associates a patient ID field 201 with several categories of patient-related data, such as demographic dimensions 203 , calculated dimensions 205 , and phenotype-based dimensions 207 .
- an association may include whether (either presently or in the past) the patient has been diagnosed with a predetermined condition, or whether the patient has ever been subject to a predetermined medical procedure, or whether the patient suffers from a predetermined disease or condition.
- Exemplary binary fields may include attention deficit hyperactivity disorder (ADHD) field 271 , procedure “X” field 272 , Asthma field 273 , therapy “Y” field 274 , and so forth.
- a general definition of a phenotype may be expressed in regular expression form as shown below:
- time-bound may include a specification that certain conditions or constraints apply (or do not apply) only over a limited period of time, or only before a predetermined date or event (an event including, e.g., a procedure or an observation), or only after a predetermined date or event, or only in a predetermined sequence (e.g., that a first procedure or observation occurs only before a second procedure or observation, and not after or at the same time as the second procedure or observation), and so forth.
- phenotypes provide dimensional definition to enable the conversion of EMR data to vectors. Phenotype vectors then can be used as raw material for EMR-based data analytics.
- Embodiments may include a library of phenotype definitions that provide core templates for both data selection (e.g., though use as inclusion and exclusion criteria) and for vector production (e.g., through use as dimension definitions).
- an initial, very simplistic, view of a phenotype might include a single code list—e.g. “does a patient take metformin”. This might expand to around 1,000 individual different codes, but it is a single phenotype, that will be represented eventually as a single dimension in a phenotype vector for the patient, indicating their metformin usage.
- a top-level phenotype may include a field or code to indicate that a patient suffers from diabetes, and a pointer to a diabetes child phenotype.
- FIG. 3 illustrates an exemplary recursive phenotype definition 300 specific to diabetes.
- Definition 300 may be described in a Boolean sense as shown below in Equation (3)
- Polycystic Ovary Syndrome 207 itself may be another phenotype, with subfields 217 a, 217 b, 217 c.
- the diabetes phenotype may provide an expansion of a diabetic condition (e.g., type 1 , type 2 , gestational, whether is taking insulin, A1C level, etc.), and a pointer to further recursed child phenotype such as a Type 2 phenotype.
- the Type 2 child phenotype in turn may provide an expansion of the type 2 condition, e.g., the presence of absence of relevant genetic conditions such as genetic defects of ⁇ -cell function, genetic defects in insulin processing or insulin action, exocrine pancreatic defects, endocrinopathies, infections, prescribed drugs, and so forth. This recursion maybe repeated indefinitely.
- FIG. 4 depicts at a high level of abstraction a system 400 that may be used in the definition and analysis of cohorts using phenotype vectors, according to an embodiment of the present disclosure.
- the system 400 may include a communication network 408 that is in communication with computing terminal 412 .
- Exemplary types of external communication devices 412 include, without limitation, desktop Personal Computers (PCs), laptops, netbooks, tablets, thin clients, other smart computing devices, and the like that are accessible via a network.
- the communication link may operate by methods or protocols such as Ethernet, Wi-Fi, and so forth.
- the computing power of computing terminal 412 may be used at least in part to manage communications with other portions of system 400 described below.
- the communication network 408 may be packet-switched and/or circuit-switched.
- An exemplary communication network 408 includes, without limitation, a Wide Area Network (WAN), such as the Internet, a Public Switched Telephone Network (PSTN), a Plain Old Telephone Service (POTS) network, a cellular communications network, or combinations thereof.
- WAN Wide Area Network
- PSTN Public Switched Telephone Network
- POTS Plain Old Telephone Service
- the communication network 408 is a public network supporting the TCP/IP suite of protocols.
- System 400 may further include server 444 , which is coupled to communication network via transceiver 446 .
- Transceiver 446 may support well-known communication or networking protocols such as Ethernet, Wi-Fi, and so forth.
- Server 444 may be capable of hosting and/or executing one or more application programs 452 (“apps” or “applications”).
- application programs 452 (“apps” or “applications”).
- server 444 may provide a phenotype execution engine as one of application programs 452 .
- the phenotype execution engine provides a computing platform that allows data scientists to create and to share phenotype definitions, and then to execute those phenotype definitions against large data sets. By executing the phenotype definitions against large data sets, data scientists are able to: (1) rapidly cut data from databases using phenotypes as inclusion and exclusion criteria; and (2) build patient vectors for the selected data using phenotypes as dimension definitions.
- Server 444 may be a software-controlled system including a processor 454 coupled to a tangible memory 456 .
- Memory 456 may comprise random access memory (RAM), a read-only memory (ROM), or combinations of these and other types of electronic memory devices.
- RAM random access memory
- ROM read-only memory
- Memory 456 may be used for various purposes such as to store code (e.g., application programs 452 ) and working memory used by processor 454 .
- Various other server 444 components such as a communication interface modules, power management modules, etc. are known by persons of skill in the art of computer design, but are not depicted in FIG. 4 in order to avoid obscuring the main elements of system 400 .
- Server 444 may be coupled to a database 462 , either directly or through communication network 408 as illustrated in FIG. 4 .
- Database 462 may also be separate from server 444 (as illustrated in FIG. 4 ), or be incorporated into server 444 .
- Database 462 may be used to store an available universe of patient data (e.g., the GPRD).
- Database 462 may represent a plurality of physically dispersed databases that are communicatively coupled together.
- system 400 The elements of system 400 are shown in FIG. 4 for purposes of illustration only and should not be construed as limiting embodiments of the present invention to any particular arrangement of elements.
- Various other system components such as a gateway, a firewall, etc. are known by persons of skill in the art of computer networking, but are not depicted in FIG. 4 in order to avoid obscuring the main elements of system 400 .
- FIG. 5 illustrates a process flow 500 to use system 400 , in accordance with an embodiment of the present disclosure.
- Process flow 500 would be controlled by a data analyst at computing terminal 412 .
- Data may be read from source EMR databases such as database 462 .
- Database-independent phenotype definitions may be provided by the data analyst, and/or read from a memory such as database 462 .
- the data analyst at computing terminal 412 then may apply or not apply criteria for a study, by way of manipulating phenotype criteria (e.g., inclusion criteria and/or exclusion criteria) for data selection.
- the data analyst may define and apply these criteria by way of a graphical interface at computing terminal 412 to produce a cohort definition.
- FIG. 5 depicts the use of database-independent phenotype definitions to build inclusion and exclusion criteria (i.e., cohort definitions). These cohort definitions can be used across multiple source EMR databases to produce data subsets for subsequent data science-based research. However, at this point the data is not “research ready” because it is not structured as vectors. Instead, the data is still notionally structured in its natural EMR-based format.
- FIG. 6 illustrates a process 600 for a second stage of processing.
- FIG. 6 uses the same library of database-independent phenotype definitions as in process 500 , but this time to define vector dimensions.
- a phenotype engine executing in server 444 e.g., executing as one of application programs 452
- the data subset will be returned as phenotype vectors.
- the data subset then may be stored in a memory (e.g., in a separate portion of database 462 ) for future study.
- the subsets are converted to a “research-ready” vector format and can be used as input to data science routines.
- a set of phenotype vectors meeting the cohort definition and/or processed by the data science routines may be used to identify patients for a cohort study, including identifying patients to recruit if the study is a prospective study.
- FIG. 7 illustrates components of computing terminal 412 .
- computing terminal 412 is a typical desktop or mobile computing device having basic functions.
- Computing terminal 412 has a user input interface 751 for receiving input from a user (e.g., a keyboard, touchscreen and/or microphone), and a user output interface 753 is provided for presenting information visually or audibly to the user.
- Computing terminal 412 also includes memory 755 for storing an operating system that controls the main functionality of computing terminal 412 , along with a number of applications that are run on computing terminal 412 , and data.
- a processor 757 executes the operating system and applications.
- Computing terminal 412 may have a unique hardware identification code that permits identification of computing terminal 412 (e.g., a medium access control (MAC) address).
- MAC medium access control
- a communications interface 759 permits communications with communication network 108 , e.g., by way of an Ethernet or Wi-Fi interface.
- a user may use computing terminal 412 in order to control the practice of embodiments described herein, and to receive and review results of the embodiments.
- FIG. 8A illustrates a simplified set of EMR records for persons and events, in accordance with an embodiment of the present disclosure.
- the simplified set of EMR records are useful to illustrate a process and paradigm for patient vector generation from existing user generated phenotype definitions.
- Embodiments support the production of phenotype definitions to be applied to EMR datasets, either singularly or within a time-based Boolean logic expression engine.
- These phenotype definitions can be codelists, test results and values, demographic details, derived variables and other entities available to embodiments in an EMR dataset, and may recursively include other phenotype definitions.
- each patient is assigned a unique patient key (PK), and the patient key is associated with a number of different characteristics for each patient, such as gender, age, geography, BMI, and so forth.
- PK patient key
- each event is associated with one patient through the PK field, and each event is associated with various characteristics such as event type, and optional value fields relevant to the event type. Any one patient may have any number of associated events, including zero associated events.
- FIG. 8A may be interrogated by building simple phenotypes and combining them in Boolean expressions.
- Time-based criteria may be supplied, instead of or in addition to event-based criteria.
- Each card may represent one Boolean condition.
- phenotype_ 3 may be “BMI ⁇ 20”
- An additional phenotype may be constructed as a Boolean combination of the simple phenotypes, e.g., a Boolean AND of the simple phenotypes, or a more complex relationship including other Boolean operators (e.g., OR, XOR) and parenthetical groupings.
- Boolean operators e.g., OR, XOR
- the background art would apply the overall Boolean condition to the patient and event data, and export the result in one of various supported formats, e.g., as a native or single row based view for each patient event.
- This export type may be relatively large, and contain all data regardless of data science needs.
- An example of the output using methods of the background art is illustrated in FIG. 8B .
- embodiments in accordance with the present disclosure may transform data into a vector-based output, by reusing the phenotype definition paradigm and applying the same definition template structure to a population to create a patient vector.
- FIG. 9A illustrates a vector-based patient definition, with each phenotype definition representing one of a Boolean condition, a value field, and a bucketed (i.e., range) value.
- the phenotype definitions can be used to output value-based, value-bucket based or binary (one shot') data in the vector.
- FIG. 9B illustrates an example of the output using embodiments in accordance with the present disclosure.
- Output from the new pivoted view of the data becomes a patient vector, a smaller, more focused output for data science, containing only the values that are required for specific observational research on a population.
- FIG. 9B illustrates a new data structure that has been derived from the person EMR data, filtered by the criteria of the phenotype definitions, the new data structure being a set of patient vectors in which each element of a respective patient vector is populated by a value within a range defined for the respective element.
- Specificity of cohort selection may be limited ultimately by the size of the set of matches that is returned. If the criteria are too narrow, not enough matches will be returned to provide a statistically meaningful sample. Options in this case may include reducing the number of criteria, adjusting error functions in one of Equations (1)-(3) to allow greater error between an ideal characteristic and an actual characteristic, eliminating some selected criteria that may be highly correlated with other selected criteria, substituting one criterion for another if the criteria are correlated but one has a larger available population than the other, and so forth.
- Embodiments in accordance with the present disclosure are usable in other fields of study besides cohort definition and selection in medical studies. Embodiments may be useful whenever multi-dimensional criteria are used to make an imperfect matching selection from among an available population that shares at least some of these criteria.
- Embodiments of the present invention include a system having one or more processing units coupled to one or more memories.
- the one or more memories may be configured to store software that, when executed by the one or more processing unit, allows practice of embodiments described herein, including at least as described in the figures and related text.
- the disclosed methods may be readily implemented in software, such as by using object or object-oriented software development environments that provide portable source code that can be used on a variety of computer or workstation platforms.
- the disclosed system may be implemented partially or fully in hardware, such as by using standard logic circuits or VLSI design. Whether software or hardware may be used to implement the systems in accordance with various embodiments of the present invention may be dependent on various considerations, such as the speed or efficiency requirements of the system, the particular function, and the particular software or hardware systems being utilized.
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
Description
- Embodiments of the present invention generally relate to observational testing, and, in particular, to a system and method for post-selection variable construction for data in an observational test.
- Observational studies are an important category of study designs. For some kinds of investigative questions (e.g., related to plastic surgery), randomized controlled trials may not always be indicated or ethical to conduct. Instead, observational studies may be the next best method to address these types of questions. Well-designed observational studies may provide results similar to randomized controlled trials, challenging the belief that observational studies are second-rate. Cohort studies and case-control studies are two primary types of observational studies that aid in evaluating associations between diseases and exposures.
- Well-designed randomized controlled trials (RCTs) have held the pre-eminent position in the hierarchy of evidence-based medicine (EBM) as level I evidence. However, RCT methodology, which was first developed for drug trials, can be difficult to conduct for some investigations (e.g., surgical cases). Instead, well-designed observational studies, recognized as level II or III evidence, can play an important role in deriving evidence for such investigations. Results from observational studies are often criticized for being vulnerable to influences by unpredictable confounding factors. However, comparable results between observational studies and RCTs are achievable. Observational studies can also complement RCTs in hypothesis generation, establishing questions for future RCTs, and defining clinical conditions.
- Observational studies fall under the category of analytic study designs and are further sub-classified as observational or experimental study designs. The goal of analytic studies is to identify and evaluate causes or risk factors of diseases or health-related events. The differentiating characteristic between observational and experimental study designs is that in the latter, the presence or absence of under-going an intervention defines the groups. By contrast, in an observational study, the investigator does not intervene and rather simply “observes” and assesses the strength of the relationship between an exposure and disease variable. Three types of observational studies include cohort studies, case-control studies, and cross-sectional studies. Case-control and cohort studies offer specific advantages by measuring disease occurrence and its association with an exposure by offering a temporal dimension (i.e., prospective or retrospective study design). Cross-sectional studies, also known as prevalence studies, examine the data on disease and exposure at one particular time point. Because the temporal relationship between disease occurrence and exposure cannot be established, cross-sectional studies cannot assess the cause and effect relationship.
- The word “cohort” is used in epidemiology to define a set of people followed over a period of time. In particular, “cohort” refers to a group of people with defined characteristics who are followed up to determine incidence of, or mortality from, some specific disease, all causes of death, or some other outcome.
- A well-designed cohort study can provide powerful results. In a cohort study, an outcome-free or disease-free study population is first identified by the exposure or event of interest, and then is followed in time until the disease or outcome of interest occurs. Because exposure is identified before the outcome, cohort studies have a temporal framework to assess causality and thus have the potential to provide the strongest scientific evidence. A cohort study is particularly advantageous for examining rare exposures because subjects are selected by their exposure status, and rates of disease may be calculated in exposed and unexposed individuals over time (e.g. incidence, relative risk). Additionally, an investigator can examine multiple outcomes simultaneously. However, the cohort study may be susceptible to selection bias. A cohort study may be large, particularly to study rare exposures, and require a large sample size and a potentially long follow-up duration of the study design, resulting in a costly endeavor.
- Cohort studies may be prospective or retrospective. Prospective studies are carried out from the present time into the future. Because prospective studies are designed with specific data collection methods, it has the advantage of being tailored to collect specific exposure data and may be more complete. A disadvantage of a prospective cohort study may include the long follow-up period while waiting for events or diseases to occur. Thus, this study design is inefficient for investigating diseases with long latency periods and is vulnerable to a high loss to follow-up rate.
- In contrast, retrospective cohort studies are better indicated for timely and inexpensive study design. Retrospective cohort studies, also known as historical cohort studies, are carried out at the present time and look to the past to examine medical events or outcomes. A cohort of subjects, selected based on exposure status, is chosen at the present time, and outcome data (i.e. disease status, event status), which was measured in the past, are reconstructed for analysis. An advantage of the retrospective study design analysis is the immediate access to the data. The study design is comparatively less costly and shorter than prospective cohort studies. However, disadvantages of retrospective study design include limited control the investigator has over data collection. The existing data may be incomplete, inaccurate, or inconsistently measured between subjects, for example, by not being uniformly recorded for all subjects.
- Conventionally, a cohort study defines the selected group of subjects by predetermined criteria (e.g., exposure to a substance, or having a particular medical condition, etc.) at the start of the investigation. A critical characteristic of subject selection is to have both the exposed and unexposed groups be selected from the same source population. Subjects who are not at risk for developing the outcome should be excluded from the study. The source population is determined by practical considerations, such as sampling. Subjects may be effectively sampled from the hospital, be members of a community, or from a doctor's individual practice. A subset of these subjects will be eligible for the study.
- When patient data is analyzed, multiple variables describing a person (e.g., age, gender, body mass index (BMI), whether or not the patient has diabetes, etc.) are manipulated. The multiple variables effectively describe criteria that are used as inputs to analysis processes to establish assertions about the statistical nature of the patients in a cohort study. The multiple variables may be represented as a patient vector, which describes the patient's various medical, geographical and demographic variables. The variables generally are produced from the previously described population's raw data, and often is created using covariates.
- A problem with this scenario is that the patient cohort definition and the output patient vector are produced in very different ways. Both the patient cohort definition and the output patient vectors require a deep understanding of the underlying data and how to construct clinical criteria in that data, both for data selection and for analytical variable creation. This requires full unfettered access to this data to produce the necessary criteria. This activity would normally be undertaken using scripts and code on a per study, per data set basis.
- Attempts have been made and have failed to adequately address the calculation of inferred selection criteria, and inferred analytical variable construction from an observed population. Attempts in the background art generally involve use set theory visualization to compare population across two attributes or data variables. However, when population selection may involve as many as 20-40 attributes, a set theory approach lacks scalability. Known solutions only allow comparison of two variables at a time and do not perform a population synthesis. Manual efforts to expand the analysis beyond two variables has many drawbacks, such as requiring costly expert labor to synthesize queries, being relatively slow, and is not adaptable to allow non-technical business users themselves to derive insights from large healthcare datasets.
- The demand for data science in health is increasing dramatically and is highlighted as one of the top growth areas across the entire global technology sector. Data scientists are highly skilled individuals with a rare combination of expertise that spans both advanced statistics and computer science. Paradoxically though, a drawback of the background art is that a significant proportion of data science activity is constantly reported as low-level data manipulation (i.e., “data wrangling”). This data manipulation is driven by the necessity to transform native data formats into a vector-based format required by the mathematics underlying data science theory.
- However, such manual selection methods for a retrospective cohort study may suffer from limited sample size or selection bias, or excessive cost. Therefore, what is needed is to combine the advantages of a retrospective cohort study without the disadvantages of difficult-to-use tools to define, find, and manipulate a cohort.
- Within the realm of EMR-based data science, and in order to overcome drawbacks of the background art, embodiments in accordance with the present disclosure define phenotypes in order to define a fundamental atomic building block to enable both data subset creation and vector creation, with phenotype vectors being the primary raw material of EMR-based data science. Embodiments provide a systematic process to determine the most significant factors that can be used to approximate a patient population group.
- Embodiments in accordance with the present disclosure provide a cohort definition and selection system for a computer having a memory, a central processing unit and a display, the system including: a cohort definition module to configure the memory according to a phenotype vector. The phenotype vector includes a patient ID to uniquely associate the phenotype vector to a patient, a plurality of demographic dimension fields, each demographic dimension field to describe a respective demographic aspect of the patient, a plurality of calculated dimension fields to describe a calculated information related to the patient, a plurality of, potentially recursively defined phenotype-based dimension fields, each phenotype-based dimension field to indicate relevance of the respective phenotype-based dimension field to the patient.
- The preceding is a simplified summary of embodiments of the disclosure to provide an understanding of some aspects of the disclosure. This summary is neither an extensive nor exhaustive overview of the disclosure and its various embodiments. It is intended neither to identify key or critical elements of the disclosure nor to delineate the scope of the disclosure but to present selected concepts of the disclosure in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the disclosure are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.
- The above and still further features and advantages of the present invention will become apparent upon consideration of the following detailed description of embodiments thereof, especially when taken in conjunction with the accompanying drawings wherein like reference numerals in the various figures are utilized to designate like components, and wherein:
-
FIGS. 1A, 1B, 1C illustrate vector representations of patient data; -
FIG. 2 illustrates an exemplary format for patient data as a phenotype, in accordance with an embodiment of the present disclosure; -
FIG. 3 illustrates an exemplary recursive phenotype definition specific to diabetes in accordance with an embodiment of the present disclosure; -
FIG. 4 depicts at a high level of abstraction a system in accordance with an embodiment of the present disclosure; -
FIG. 5 illustrates a process flow in accordance with an embodiment of the present disclosure; -
FIG. 6 illustrates a process for a second stage of processing, in accordance with an embodiment of the present disclosure; -
FIG. 7 illustrates components of computing terminal, in accordance with an embodiment of the present disclosure; -
FIG. 8A illustrates a simplified set of EMR records for persons and events as known in the art; -
FIG. 8B illustrates an example of the output using methods known in the art; -
FIG. 9A illustrates a vector-based patient definition, in accordance with an embodiment of the present disclosure; and -
FIG. 9B illustrates an example of an output, in accordance with the present disclosure. - The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including but not limited to. To facilitate understanding, like reference numerals have been used, where possible, to designate like elements common to the figures. Optional portions of the figures may be illustrated using dashed or dotted lines, unless the context of usage indicates otherwise.
- The disclosure will be illustrated below in conjunction with an exemplary digital information system. Although well suited for use with, e.g., a system using a server(s) and/or database(s), the disclosure is not limited to use with any particular type of system or configuration of system elements. Those skilled in the art will recognize that the disclosed techniques may be used in any system or process in which it is desirable whenever multi-dimensional criteria are used to make an imperfect matching selection from among an available population that shares at least some of these criteria.
- The exemplary systems and methods of this disclosure will also be described in relation to software, modules, and associated hardware. However, to avoid unnecessarily obscuring the present disclosure, the following description omits well-known structures, components and devices that may be shown in block diagram form, are well known, or are otherwise summarized.
- In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of embodiments or other examples described herein. In some instances, well-known methods, procedures, components and circuits have not been described in detail, so as to not obscure the following description. Further, the examples disclosed are for exemplary purposes only and other examples may be employed in lieu of, or in combination with, the examples disclosed. It should also be noted the examples presented herein should not be construed as limiting of the scope of embodiments of the present invention, as other equally effective examples are possible and likely.
- As used herein, the term “module” refers generally to a logical sequence or association of steps, processes or components. For example, a software module may comprise a set of associated routines or subroutines within a computer program. Alternatively, a module may comprise a substantially self-contained hardware device. A module may also comprise a logical set of processes irrespective of any software or hardware implementation.
- A module that performs a function also may be referred to as being configured to perform the function, e.g., a data module that receives data also may be described as being configured to receive data. Configuration to perform a function may include, for example: providing and executing sets of computer code in a processor that performs the function; providing provisionable configuration parameters that control, limit, enable or disable capabilities of the module (e.g., setting a flag, setting permissions, setting threshold levels used at decision points, etc.); providing or removing a physical connection, such as a jumper to select an option, or to enable/disable an option; attaching a physical communication link; enabling a wireless communication link; providing electrical circuitry that is designed to perform the function without use of a processor, such as by use of discrete components and/or non-CPU integrated circuits; setting a value of an adjustable component (e.g., a tunable resistance or capacitance, etc.), energizing a circuit that performs the function (e.g., providing power to a transceiver circuit in order to receive data); providing the module in a physical size that inherently performs the function (e.g., an RF antenna whose gain and operating frequency range is determined or constrained by the physical size of the RF antenna, etc.), and so forth.
- As used herein, the term “transmitter” may generally comprise any device, circuit, or apparatus capable of transmitting a signal. As used herein, the term “receiver” may generally comprise any device, circuit, or apparatus capable of receiving a signal. As used herein, the term “transceiver” may generally comprise any device, circuit, or apparatus capable of transmitting and receiving a signal. As used herein, the term “signal” may include one or more of an electrical signal, a radio signal, an optical signal, an acoustic signal, and so forth.
- The term “computer-readable medium” as used herein refers to any tangible storage and/or transmission medium that participate in storing and/or providing instructions to a processor for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, NVRAM, or magnetic or optical disks. Volatile media includes dynamic memory, such as main memory. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, magneto-optical medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, solid state medium like a memory card, any other memory chip or cartridge, a carrier wave as described here-inafter, or any other medium from which a computer can read. A digital file attachment to e-mail or other self-contained information archive or set of archives is considered a distribution medium equivalent to a tangible storage medium. When the computer-readable media is configured as a database, it is to be understood that the database may be any type of database, such as relational, hierarchical, object-oriented, and/or the like. Accordingly, the disclosure is considered to include a tangible storage medium or distribution medium and prior art-recognized equivalents and successor media, in which the software implementations of the present disclosure are stored.
- At the present time, large-scale routine healthcare databases are amassed and maintained based upon data gathered by healthcare providers and healthcare insurers. For example, a patient who submits to routine health care such as a yearly checkup, regularly-scheduled pap smears or mammograms, or visits for acute but relatively minor problems such as an infection, stitches, or broken bone, will have associated with them a series of healthcare records over time. Healthcare records may also include information related to non-routine care such as emergency room visits, hospital admissions, or other serious healthcare events. The healthcare records may document the progress over time of chronic conditions such as cholesterol levels, high blood pressure, and the like. The healthcare records may also include demographic information such as age, ethnicity, height, weight, and so forth. Because a large portion of the population has access to and uses health care, and the portion may grow in future years due to the Affordable Care Act or its successor, such data is a vast source of information over a large portion or cross-section of the population, representing persons of many different characteristics, risk factors, and so forth. The data for any individual patient may also be available over an extended period of time such as a period of years, so that changes in slowly-progressing medical conditions or slowly-changing patient characteristics may be captured by the data.
- In the United Kingdom (UK), such healthcare records may include sources such as the Clinical Practice Research Datalink (CPRD) primary care database (GOLD), the hospital episode statistics (HES) and the Office for National Statistics (ONS) mortality data.
- For example, the CPRD, established (initially as GPRD) in the UK in 1987, is a medical records database that general practitioners (GPs) use as the primary means of tracking patient clinical information. The total population in the CPRD exceeds nine million patients with over 35 million person-years of follow-up between 1987 and 2002. About 5% of the UK population is in the CPRD, which is broadly representative of the general UK population in terms of age, sex and geographic distribution. The CPRD, which contains information on diagnoses and medications, was established with the intent of allowing researchers to conduct high quality epidemiologic studies and has been used in more than 200 peer-reviewed publications. All information is recorded by the GP or a member of the office staff as part of the patient's medical record. Approximately 1,500 general practitioners representing 500 practices across the UK participated in the CPRD between 1987 and 2001. GPs are trained in data entry and their data are reviewed by administrators at the CPRD to ensure that they are of sufficient quality for research studies.
- Healthcare analysis and research increasing may rely upon the use of such large-scale routine healthcare databases, in particular for retrospective cohort studies. Such databases, because of the coverage over a large portion or cross-section of the population, representing persons of many different characteristics, risk factors, and so forth, may reduce the drawbacks of traditional retrospective cohort studies such as existing data being incomplete, inaccurate, or inconsistently measured between subjects, for example, by not being uniformly recorded for all subjects. Standardized tests for blood work, pap smear, and other routine procedures encourages uniformity and completeness of monitored healthcare parameters.
- To work with large-scale routine healthcare databases for any use, the definition of the relevant population under study is the first step and an important step. There may be more than one relevant population, for example, a first population that has developed a particular condition, and a second population that has not developed the particular condition as of the time of selection. The selection criteria form an important part of protocols (i.e., population criteria and analysis plan) used for clinical trial and health outcomes studies.
- In observational studies, vectorized data forms the basis of many statistical analysis techniques. A problem of the background art is that patient data is seldom available in vectorized form without significant data manipulation. Patient data is typically transactional and time-based (i.e., “longitudinal”). Patient data primarily includes two classes of data, i.e., “people” and “events”. People data refers to the patient or enrollee (e.g., a spouse for spousal insurance coverage), enrollment related data (e.g., dates of coverage, exclusions, deductibles, employer, etc.). Event data refers to things that happen to patients, e.g., diagnoses, therapies, procedures, etc.
- Significant, both computational and intellectual data manipulation is required to convert a transactional electronic medical record (EMR) data structure in to a research-ready, vector-based structure. In the background art, this intellectual and computational data manipulation is specific to a native EMR data structure and hence is not readily portable from one data set to another.
- The data manipulation (sometimes termed “data wrangling”) tends to cover two primary activities. First, it may refer to cutting a subset of data from source databases that are relevant to a study being undertaken. Second, it may refer to creating a research-ready data format for that data, i.e., a vector-based data format that can be used as input to the processes and calculations of data science. Conventionally, data wrangling is low-level labor-intensive, data set specific activity, thus a higher-level, data set portable, less labor-intensive method is needed.
- However, embodiments improve upon the background art by recognizing that if the data science processes had been defined with respect to standardized vector formats, then the processes should be portable across different data-sets. This positions a vector format as a central, data-set portable pivot point for data science. In vector form, valuable data science processes (e.g., cohort matching, regression analysis and clustering, described in
FIGS. 1A-1C , respectively) may be applied to the vector formats, and help enable the analysis and processes to be more portable from one analytic study to another. - Embodiments in accordance with the present disclosure convert medical data to phenotype vectors. With processes and systems designed around phenotype definitions, data manipulation and vector-production may be largely automated, thus enabling a dramatic increase in data science analytic output, e.g., a four-fold capacity increase may be realized. Furthermore, embodiments help enable portability of data and analytic processes, as opposed to processes tied to a data format that is specific only to a predetermined database. High-level tools for Phenotype vector production have a potential to drive significant gains in output and productivity of data analysis.
- Data with multiple attributes may be represented as a vector in a multidimensional space, with each dimension of the vector representing one attribute, and taking on values within an allowable range of values for the attribute. Geometrically, the vector at least in two or three dimensions may be represented as an arrow, with a magnitude and direction in an axis corresponding to the sign and magnitude of a corresponding dimension of the vector.
- Two vectors can then be thought of as “close” if the distance between their end-points is small. An error may be calculated as a function of the distance between the vectors. For example, for a vector X=(xl, x2, x3) and a vector Y=(y1, y2, y3), one measure of the difference between X and Y is given by Equation (1) below.
-
| X−Y |=√{square root over ((x 1 −y 1)2+(x 2 −y 2)2+(z 1 −z 2)2)} (1) - Equation (1) represents the Euclidian geometric distance between the vectors. More generally, different metric functions may be used to define the distance between the vectors taking into account the statistical properties of each dimension for example. Most generally the distance between X and Y is given by a metric function M as shown below in Equation (2):
-
distance=|X−Y=M( X, Y ) (2) - In some embodiments, a weighting function may be applied to the difference in each dimension, and for an overall sum. Weighting functions may be useful if the respective vector dimensions have unequal importance for the purpose of patient selection. For example, a distance function (i.e., error function) may take the form shown in Equation (3) for a three-dimensional vector, in which the dimensional weighting functions G( ), H( ) and I( ) may be, e.g., triangle functions, exponential decay functions, step functions, etc., and function F( ) may be, e.g., a summation function, a multiplication, a root, a power, a ratio, or some combination thereof may be used for some dimensions compared to other dimensions, e.g., in order to give unequal weight to different dimensions. However, not all dimensional weighting functions need to be different from other dimensional weighting functions. Equation (3) may be extended to additional dimensions by use of additional weighting functions.
-
| X−Y|=F(G(x 1 −y 1), H(x 2 −y 1), I(x 3 −y 3)) (3) - Other distance metrics may be used instead of the embodiment shown in Equation (3), as known by persons of skill in the art. For example, a distance metric may include one or more of a Mahalanobis distance metric and a joint weighting function of more than one dimension. A joint function may be useful if, e.g., some dimensions are cross-correlated. For example, separate dimensions for patient weight and patient BMI may be expected to be cross-correlated.
- The representations of Equations (1)-(3) are useful because all data science is grounded in some underlying formal mathematical theory, and that mathematical theory is almost entirely vector based.
- As applied to analysis of patient data, embodiments may manipulate multiple variables for a single person. Patient characteristics may be represented as a multi-dimensional vector. Patient characteristics may include sociodemographic factors (e.g., age, sex, place of residence, etc.), clinical factors (e.g., comorbidities, medical history, genetic history, blood type, medications used in the week prior to presentation, functional status, immunization history, smoking status, drinking status, etc.), and laboratory data. Dozens of characteristics may be relevant or possibly relevant. Relevancy may be dependent upon the type of study and/or objective of the study, and may be informed by existing medical knowledge. For example, patient weight may be more relevant to a diabetes study than patient eye characteristics, but patient eye characteristics may have more relevance to a study of eye disease. In this case, the selection criteria may give greater weight to characteristics relevant to an objective of selecting the cohort.
-
FIG 1A illustrates three patient vectors (i.e., e1 , e2, e3) for an exposed cohort, compared to three nearest possible matches in a matched cohort (i.e., m1, m2, m3). Matched patients are found by looking for patient vectors which are nearest in space to each patient in the exposed group, with “nearest” being calculated by a relationship such as one of Equations (1)-(3). There are many additional techniques used for cohort matching (e.g., propensity score matching, principle component analysis, coarsened exact matching, and so forth). Even though these techniques differ in the metric they use to measure the distance between two patient vectors, they all still use vectors as their input. -
FIG. 1B illustrates usage of a regression analysis technique usable in predictive or interpolative analytics. For example, if p1, p2 and p3 are three patient vectors in a training data set, then p4 may represent a predicted trend given the input data. Given a mathematical model (e.g., a linear trend, a polynomial trend, a seasonally adjusted trend, etc., that may be formulated with knowledge of the underlying causes of the trend), a best fit mathematical equation may be found for the points in space represented by the input vectors and the output to be predicted. These equations then are used to predict the output or outcome for arbitrary patients as represented by their input vectors. -
FIG. 1C illustrates an application of clustering processes, which may be used regularly in predictive analytics to predict potential markets for products. For example, p1, p2 and p3 may represent a first market sector or cluster of similar subjects, and p4, p5 and p6 may represent a second market sector or cluster of similar subjects. A distance between points in space represented by vectors is used to identify close neighbors and hence generate clusters of subjects that may be regarded as ‘alike’. - In some embodiments, each patient characteristic over a population of patients may be expressed as a statistic that represents the population as a whole. For example, the statistic may be in a form such as a histogram, a series of numeric ranges (e.g., 40-50 years old; 50-60 years old; 150-160 lbs; 160-170 lbs; etc), a series of qualitative ranges (e.g., non-drinker vs. social drinker vs. heavy drinker, etc.), and so forth. Other mathematical representations of the multi-dimensional vector may be possible. Patient characteristics may not be independent of each other, e.g., selection of a female gender characteristic may result in a smaller and lighter population of patients compared to a selection of a male gender characteristic. The data is complex and highly dimensional. Researchers have to make assumptions, based upon science, intuition or other data analysis, that involve structure that is believed to exist in the data but that cannot be observed directly. The data sets are large and growing with a never-ending stream of new data.
- Some patients may be classified by use of one or more population codes. The population codes, in turn, represent characteristics of interest to a retrospective cohort study. For example, one population coding system is ICD-10, which is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO). ICD-10 codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases. The code set allows more than 14,400 different codes and permits the tracking of many new diagnoses. The codes can be expanded to over 16,000 codes by using optional sub-classifications. The detail reported by ICD can be further increased, with a simplified multi-axial approach, by using codes meant to be reported in a separate data field.
- Another population coding system is the Read code, which is the standard clinical terminology system used in General Practice in the United Kingdom (UK). Read codes support detailed clinical encoding of multiple patient phenomena including: occupation; social circumstances; ethnicity and religion; clinical signs, symptoms and observations; laboratory tests and results; diagnoses; diagnostic, therapeutic or surgical procedures performed; and a variety of administrative items (e.g. whether a screening recall has been sent and by what communication modality, or whether an item of service fee has been claimed). It therefore includes but goes significantly beyond the expressivity of a diagnosis coding system.
- Conventionally, synthesis of population selection rules also must be performed manually by such an expert. Synthesis is known as a process of reducing from potentially hundreds of patient population codes to a much smaller set of medical factors, the factors being referred to as inclusion factors or exclusion factors. For example, for a predetermined asthma population (e.g., patients that were initially diagnosed between 12-17 years of age) a medical researcher may decide to look at only patients who were treated with either of two drugs: inhaled corticosteroids (ICS) or fluticasone (i.e., an example of an inclusion criterion). Each of those drugs will have a specific code which usually less recognizable to medical researchers than the drug name itself. In addition to looking at these drugs, a medical researcher may also set another rule to study only patients who were treated in a primary care setting. However, in practice a rule to narrow a study only to patients who were treated in a primary care setting may not be significant because virtually all asthma patients are treated in a primary care setting and thus fails to narrow the population much in practice. Manual synthesis may fail to recognize that such a rule is not significantly meaningful. Thus, manual synthesis may include such a criterion whereas an automated method may recognize that the criterion is not significantly meaningful and thus would not include the criterion in a summary.
- In the background art, synthesis of population selection rules is accomplished by constructing detailed queries in a structured language such as SQL. A query may have a large number (e.g., dozens) of components, and be in the form of: ((field_1=“value_1”) OR (field_1=“value_2”) AND (field_2=“value_3”) AND NOT (field_3=“value_4”) AND . . . OR . . . ). As can be appreciated, this is tedious to construct and difficult to tweak as a desired analytic inquiry changes.
- Embodiments in accordance with the present disclosure provide building blocks that may be useful to construct a patient vector to describe each respective patient, and to use the patient vectors to identify patient cohorts for further study. Embodiments may leverage an advantage that arises from having a common vector format used by multiple scientific groups. Embodiments will speed up the research process, allowing a deeper understanding of the methods applied to the common vector format, and allow patient descriptions to be transferred easily between individuals and computer systems.
- Embodiments build, extract, and store a common phenotype vector based on multiple patient medical databases, is reusable across multiple projects or studies, and is formatted in a way that isolates users from the underlying data.
- Embodiments in accordance with the present disclosure address a problem of vectorizing patient data by creating a framework to define the vector forms, and a system to convert old data to the vector form, and/or enforce the vector form for new data. Phenotypes and phenotype vectors are a useful paradigm to create or reformat vectorized patient data, to define the dimensions of those vectors in a portable manner, and to perform data science on patient data.
- A phenotype may be defined as a set of observable characteristics of an individual resulting from the interaction of its genotype with the environment. Embodiments provide a specific implementation that enables rapid, generalized phenotype-vector production from EMR databases. More generally, a phenotype may be defined as an arbitrary Boolean combination of demographic information, code lists, or lists of values representing conditions, drugs, observations, procedures etc. Each code or value list may include some absolute or relative time (i.e., temporal) constraints, and we may additionally specify time relationships between individual lists, e.g., people who have a severe asthma diagnosis after being diagnosed with ADHD.
-
FIG. 2 illustrates an exemplary format for patient data asphenotype 200, in accordance with an embodiment of the present disclosure.Phenotype 200 associates apatient ID field 201 with several categories of patient-related data, such asdemographic dimensions 203, calculateddimensions 205, and phenotype-baseddimensions 207. For example,demographic dimensions 203 may include a binary gender field 231 (e.g., 1=“male”, 0=“female”) and anumeric age field 233; calculateddimensions 205 may include an age atindex date field 251 and a duration of therapy field 253 (e.g., number of days); phenotype-baseddimensions 207 may include a plurality of binary fields, each of which indicates whether the patient indicated bypatient ID field 201 is associated with the condition indicated by the respective binary field. - For example, an association may include whether (either presently or in the past) the patient has been diagnosed with a predetermined condition, or whether the patient has ever been subject to a predetermined medical procedure, or whether the patient suffers from a predetermined disease or condition. Each binary field may be indicated with 1=“true” and 0=“false”. Exemplary binary fields may include attention deficit hyperactivity disorder (ADHD)
field 271, procedure “X” field 272,Asthma field 273, therapy “Y” field 274, and so forth. An exemplary phenotype vector may be V p(700333xx)=(1, 18, 14, 177, 0, 1, 1, 0, . . . ). - A general definition of a phenotype may be expressed in regular expression form as shown below:
-
- Phenotype =Boolean and time related Combination of:
- [Lists of conditions (optionally time-bound) |
- Lists of drugs (optionally time-bound) |
- Lists of observations (optionally time-bound) |
- List of procedures (optionally time-bound) |
- Phenotypes (optionally time-bound)]
- Phenotype =Boolean and time related Combination of:
- Examples of “time-bound” may include a specification that certain conditions or constraints apply (or do not apply) only over a limited period of time, or only before a predetermined date or event (an event including, e.g., a procedure or an observation), or only after a predetermined date or event, or only in a predetermined sequence (e.g., that a first procedure or observation occurs only before a second procedure or observation, and not after or at the same time as the second procedure or observation), and so forth.
- As is evident, phenotypes provide dimensional definition to enable the conversion of EMR data to vectors. Phenotype vectors then can be used as raw material for EMR-based data analytics. Embodiments may include a library of phenotype definitions that provide core templates for both data selection (e.g., though use as inclusion and exclusion criteria) and for vector production (e.g., through use as dimension definitions).
- For EMR data, an initial, very simplistic, view of a phenotype might include a single code list—e.g. “does a patient take metformin”. This might expand to around 1,000 individual different codes, but it is a single phenotype, that will be represented eventually as a single dimension in a phenotype vector for the patient, indicating their metformin usage.
- The last clause of the general definition of a phenotype provides a recursive definition. The recursive definition allows an arbitrarily complex phenotypes to be defined by consuming and combining definitions of other, child phenotypes to substantially any level of depth. For example, a top-level phenotype may include a field or code to indicate that a patient suffers from diabetes, and a pointer to a diabetes child phenotype.
-
FIG. 3 illustrates an exemplary recursive phenotype definition 300 specific to diabetes. Definition 300 may be described in a Boolean sense as shown below in Equation (3) -
[Diabetes code list 201] OR [Metformin NDC codelist 203] OR [Insulin NDC codelist 205] AND NOT [Polycystic Ovary Syndrome 207] (3) -
Polycystic Ovary Syndrome 207 itself may be another phenotype, withsubfields - Alternate but similar recursive phenotype definitions may be provided in addition to phenotype definition 300. For example, the diabetes phenotype may provide an expansion of a diabetic condition (e.g.,
type 1,type 2, gestational, whether is taking insulin, A1C level, etc.), and a pointer to further recursed child phenotype such as aType 2 phenotype. TheType 2 child phenotype in turn may provide an expansion of thetype 2 condition, e.g., the presence of absence of relevant genetic conditions such as genetic defects of β-cell function, genetic defects in insulin processing or insulin action, exocrine pancreatic defects, endocrinopathies, infections, prescribed drugs, and so forth. This recursion maybe repeated indefinitely. -
FIG. 4 depicts at a high level of abstraction asystem 400 that may be used in the definition and analysis of cohorts using phenotype vectors, according to an embodiment of the present disclosure. Thesystem 400 may include acommunication network 408 that is in communication withcomputing terminal 412. Exemplary types ofexternal communication devices 412 include, without limitation, desktop Personal Computers (PCs), laptops, netbooks, tablets, thin clients, other smart computing devices, and the like that are accessible via a network. The communication link may operate by methods or protocols such as Ethernet, Wi-Fi, and so forth. The computing power of computing terminal 412 may be used at least in part to manage communications with other portions ofsystem 400 described below. - The
communication network 408 may be packet-switched and/or circuit-switched. Anexemplary communication network 408 includes, without limitation, a Wide Area Network (WAN), such as the Internet, a Public Switched Telephone Network (PSTN), a Plain Old Telephone Service (POTS) network, a cellular communications network, or combinations thereof. In one configuration, thecommunication network 408 is a public network supporting the TCP/IP suite of protocols. -
System 400 may further includeserver 444, which is coupled to communication network viatransceiver 446.Transceiver 446 may support well-known communication or networking protocols such as Ethernet, Wi-Fi, and so forth.Server 444 may be capable of hosting and/or executing one or more application programs 452 (“apps” or “applications”). For example,server 444 may provide a phenotype execution engine as one ofapplication programs 452. The phenotype execution engine provides a computing platform that allows data scientists to create and to share phenotype definitions, and then to execute those phenotype definitions against large data sets. By executing the phenotype definitions against large data sets, data scientists are able to: (1) rapidly cut data from databases using phenotypes as inclusion and exclusion criteria; and (2) build patient vectors for the selected data using phenotypes as dimension definitions. -
Server 444 may be a software-controlled system including aprocessor 454 coupled to atangible memory 456.Memory 456 may comprise random access memory (RAM), a read-only memory (ROM), or combinations of these and other types of electronic memory devices.Memory 456 may be used for various purposes such as to store code (e.g., application programs 452) and working memory used byprocessor 454. Variousother server 444 components such as a communication interface modules, power management modules, etc. are known by persons of skill in the art of computer design, but are not depicted inFIG. 4 in order to avoid obscuring the main elements ofsystem 400. -
Server 444 may be coupled to adatabase 462, either directly or throughcommunication network 408 as illustrated inFIG. 4 .Database 462 may also be separate from server 444 (as illustrated inFIG. 4 ), or be incorporated intoserver 444.Database 462 may be used to store an available universe of patient data (e.g., the GPRD).Database 462 may represent a plurality of physically dispersed databases that are communicatively coupled together. - The elements of
system 400 are shown inFIG. 4 for purposes of illustration only and should not be construed as limiting embodiments of the present invention to any particular arrangement of elements. Various other system components such as a gateway, a firewall, etc. are known by persons of skill in the art of computer networking, but are not depicted inFIG. 4 in order to avoid obscuring the main elements ofsystem 400. -
FIG. 5 illustrates aprocess flow 500 to usesystem 400, in accordance with an embodiment of the present disclosure.Process flow 500 would be controlled by a data analyst at computingterminal 412. Data may be read from source EMR databases such asdatabase 462. Database-independent phenotype definitions may be provided by the data analyst, and/or read from a memory such asdatabase 462. The data analyst at computing terminal 412 then may apply or not apply criteria for a study, by way of manipulating phenotype criteria (e.g., inclusion criteria and/or exclusion criteria) for data selection. The data analyst may define and apply these criteria by way of a graphical interface at computing terminal 412 to produce a cohort definition. -
FIG. 5 depicts the use of database-independent phenotype definitions to build inclusion and exclusion criteria (i.e., cohort definitions). These cohort definitions can be used across multiple source EMR databases to produce data subsets for subsequent data science-based research. However, at this point the data is not “research ready” because it is not structured as vectors. Instead, the data is still notionally structured in its natural EMR-based format. -
FIG. 6 illustrates aprocess 600 for a second stage of processing.FIG. 6 uses the same library of database-independent phenotype definitions as inprocess 500, but this time to define vector dimensions. After the cohort definition is determined inprocess 500, a phenotype engine executing in server 444 (e.g., executing as one of application programs 452) will apply the cohort definition and produce a data subset that meets the cohort definition. The data subset will be returned as phenotype vectors. The data subset then may be stored in a memory (e.g., in a separate portion of database 462) for future study. Once the EMR data subsets are passed through the process ofFIG. 6 , the subsets are converted to a “research-ready” vector format and can be used as input to data science routines. For example, a set of phenotype vectors meeting the cohort definition and/or processed by the data science routines may be used to identify patients for a cohort study, including identifying patients to recruit if the study is a prospective study. -
FIG. 7 illustrates components ofcomputing terminal 412. As illustrated, in this embodiment, computingterminal 412 is a typical desktop or mobile computing device having basic functions.Computing terminal 412 has a user input interface 751 for receiving input from a user (e.g., a keyboard, touchscreen and/or microphone), and auser output interface 753 is provided for presenting information visually or audibly to the user.Computing terminal 412 also includesmemory 755 for storing an operating system that controls the main functionality of computing terminal 412, along with a number of applications that are run on computingterminal 412, and data. Aprocessor 757 executes the operating system and applications.Computing terminal 412 may have a unique hardware identification code that permits identification of computing terminal 412 (e.g., a medium access control (MAC) address). At least a portion ofmemory 755 may be encrypted. Acommunications interface 759 permits communications with communication network 108, e.g., by way of an Ethernet or Wi-Fi interface. A user may use computing terminal 412 in order to control the practice of embodiments described herein, and to receive and review results of the embodiments. -
FIG. 8A illustrates a simplified set of EMR records for persons and events, in accordance with an embodiment of the present disclosure. The simplified set of EMR records are useful to illustrate a process and paradigm for patient vector generation from existing user generated phenotype definitions. Embodiments support the production of phenotype definitions to be applied to EMR datasets, either singularly or within a time-based Boolean logic expression engine. These phenotype definitions can be codelists, test results and values, demographic details, derived variables and other entities available to embodiments in an EMR dataset, and may recursively include other phenotype definitions. - In the exemplary EMR person structure of
FIG. 8A , each patient is assigned a unique patient key (PK), and the patient key is associated with a number of different characteristics for each patient, such as gender, age, geography, BMI, and so forth. In the exemplary EMR event structure, each event is associated with one patient through the PK field, and each event is associated with various characteristics such as event type, and optional value fields relevant to the event type. Any one patient may have any number of associated events, including zero associated events. - In the background art, the structure of
FIG. 8A may be interrogated by building simple phenotypes and combining them in Boolean expressions. Time-based criteria may be supplied, instead of or in addition to event-based criteria. Each card may represent one Boolean condition. For example, phenotype_1 may be “Gender=M”, phenotype_2 may be “Age=30-50”, phenotype_3 may be “BMI≥20” and phenotype_4 may be “EventType=Diabetes”. An additional phenotype may be constructed as a Boolean combination of the simple phenotypes, e.g., a Boolean AND of the simple phenotypes, or a more complex relationship including other Boolean operators (e.g., OR, XOR) and parenthetical groupings. - Next, the background art would apply the overall Boolean condition to the patient and event data, and export the result in one of various supported formats, e.g., as a native or single row based view for each patient event. This export type may be relatively large, and contain all data regardless of data science needs. An example of the output using methods of the background art is illustrated in
FIG. 8B . - In contrast, embodiments in accordance with the present disclosure may transform data into a vector-based output, by reusing the phenotype definition paradigm and applying the same definition template structure to a population to create a patient vector.
FIG. 9A illustrates a vector-based patient definition, with each phenotype definition representing one of a Boolean condition, a value field, and a bucketed (i.e., range) value. The phenotype definitions can be used to output value-based, value-bucket based or binary (one shot') data in the vector. -
FIG. 9B illustrates an example of the output using embodiments in accordance with the present disclosure. Output from the new pivoted view of the data becomes a patient vector, a smaller, more focused output for data science, containing only the values that are required for specific observational research on a population.FIG. 9B illustrates a new data structure that has been derived from the person EMR data, filtered by the criteria of the phenotype definitions, the new data structure being a set of patient vectors in which each element of a respective patient vector is populated by a value within a range defined for the respective element. - Specificity of cohort selection may be limited ultimately by the size of the set of matches that is returned. If the criteria are too narrow, not enough matches will be returned to provide a statistically meaningful sample. Options in this case may include reducing the number of criteria, adjusting error functions in one of Equations (1)-(3) to allow greater error between an ideal characteristic and an actual characteristic, eliminating some selected criteria that may be highly correlated with other selected criteria, substituting one criterion for another if the criteria are correlated but one has a larger available population than the other, and so forth.
- Embodiments in accordance with the present disclosure are usable in other fields of study besides cohort definition and selection in medical studies. Embodiments may be useful whenever multi-dimensional criteria are used to make an imperfect matching selection from among an available population that shares at least some of these criteria.
- Embodiments of the present invention include a system having one or more processing units coupled to one or more memories. The one or more memories may be configured to store software that, when executed by the one or more processing unit, allows practice of embodiments described herein, including at least as described in the figures and related text.
- The disclosed methods may be readily implemented in software, such as by using object or object-oriented software development environments that provide portable source code that can be used on a variety of computer or workstation platforms. Alternatively, the disclosed system may be implemented partially or fully in hardware, such as by using standard logic circuits or VLSI design. Whether software or hardware may be used to implement the systems in accordance with various embodiments of the present invention may be dependent on various considerations, such as the speed or efficiency requirements of the system, the particular function, and the particular software or hardware systems being utilized.
- While the foregoing is directed to embodiments of the present invention, other and further embodiments of the present invention may be devised without departing from the basic scope thereof. It is understood that various embodiments described herein may be utilized in combination with any other embodiment described, without departing from the scope contained herein. Further, the foregoing description is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. Certain exemplary embodiments may be identified by use of an open-ended list that includes wording to indicate that the list items are representative of the embodiments and that the list is not intended to represent a closed list exclusive of further embodiments. Such wording may include “e.g.,” “etc.,” “such as,” “for example,” “and so forth,” “and the like,” etc., and other wording as will be apparent from the surrounding context.
- No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. Further, the terms “any of” followed by a listing of a plurality of items and/or a plurality of categories of items, as used herein, are intended to include “any of,” “any combination of,” “any multiple of,” and/or “any combination of multiples of” the items and/or the categories of items, individually or in conjunction with other items and/or other categories of items.
- Moreover, the claims should not be read as limited to the described order or elements unless stated to that effect. In addition, use of the term “means” in any claim is intended to invoke 35 U.S.C. § 112(f), and any claim without the word “means” is not so intended.
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/478,282 US11574707B2 (en) | 2017-04-04 | 2017-04-04 | System and method for phenotype vector manipulation of medical data |
US18/105,004 US20230178189A1 (en) | 2017-04-04 | 2023-02-02 | System and method for phenotype vector manipulation of medical data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/478,282 US11574707B2 (en) | 2017-04-04 | 2017-04-04 | System and method for phenotype vector manipulation of medical data |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/105,004 Continuation US20230178189A1 (en) | 2017-04-04 | 2023-02-02 | System and method for phenotype vector manipulation of medical data |
Publications (2)
Publication Number | Publication Date |
---|---|
US20180285526A1 true US20180285526A1 (en) | 2018-10-04 |
US11574707B2 US11574707B2 (en) | 2023-02-07 |
Family
ID=63670610
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/478,282 Active 2039-02-06 US11574707B2 (en) | 2017-04-04 | 2017-04-04 | System and method for phenotype vector manipulation of medical data |
US18/105,004 Pending US20230178189A1 (en) | 2017-04-04 | 2023-02-02 | System and method for phenotype vector manipulation of medical data |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/105,004 Pending US20230178189A1 (en) | 2017-04-04 | 2023-02-02 | System and method for phenotype vector manipulation of medical data |
Country Status (1)
Country | Link |
---|---|
US (2) | US11574707B2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112967817A (en) * | 2021-02-02 | 2021-06-15 | 武汉大学 | Epidemiological research population screening method based on medical big data and storage medium |
US20220044826A1 (en) * | 2018-12-31 | 2022-02-10 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US20230154628A1 (en) * | 2020-04-07 | 2023-05-18 | Nippon Telegraph And Telephone Corporation | Analysis apparatus, analysis method and program |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11493585B2 (en) * | 2018-06-29 | 2022-11-08 | Canon Medical Systems Corporation | Medical information processing apparatus and medical information processing method |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6216249B1 (en) * | 1999-03-03 | 2001-04-10 | Cirrus Logic, Inc. | Simplified branch metric for reducing the cost of a trellis sequence detector in a sampled amplitude read channel |
US20020065805A1 (en) * | 2000-11-30 | 2002-05-30 | Beals Thomas P. | Method for organizing laboratory information in a database |
US6417802B1 (en) * | 2000-04-26 | 2002-07-09 | Litton Systems, Inc. | Integrated inertial/GPS navigation system |
US20020119451A1 (en) * | 2000-12-15 | 2002-08-29 | Usuka Jonathan A. | System and method for predicting chromosomal regions that control phenotypic traits |
US20040172267A1 (en) * | 2002-08-19 | 2004-09-02 | Jayendu Patel | Statistical personalized recommendation system |
US20060052945A1 (en) * | 2004-09-07 | 2006-03-09 | Gene Security Network | System and method for improving clinical decisions by aggregating, validating and analysing genetic and phenotypic data |
US20070192139A1 (en) * | 2003-04-22 | 2007-08-16 | Ammon Cookson | Systems and methods for patient re-identification |
US20070239551A1 (en) * | 2006-03-30 | 2007-10-11 | Zeller Michelle G | Method and apparatus for a product ordering system |
US20070260492A1 (en) * | 2006-03-09 | 2007-11-08 | Microsoft Corporation | Master patient index |
US20110206246A1 (en) * | 2008-04-21 | 2011-08-25 | Mts Investments Inc. | System and method for statistical mapping between genetic information and facial image data |
US20150142331A1 (en) * | 2012-12-14 | 2015-05-21 | Celmatix, Inc. | Methods and devices for assessing risk of female infertility |
US20160275259A1 (en) * | 2013-11-01 | 2016-09-22 | Koninklijke Philips N.V. | Patient feedback for uses of therapeutic device |
US20180004803A1 (en) * | 2015-02-20 | 2018-01-04 | Hewlett-Packard Development Company, L.P. | Iterative visualization of a cohort for weighted high-dimensional categorical data |
US20190057182A1 (en) * | 2015-05-22 | 2019-02-21 | Csts Health Care Inc. | Biomarker-driven molecularly targeted combination therapies based on knowledge representation pathway analysis |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2732171C (en) * | 2008-08-28 | 2019-01-15 | Aureon Laboratories, Inc. | Systems and methods for treating, diagnosing and predicting the occurrence of a medical condition |
US20140257045A1 (en) * | 2013-03-08 | 2014-09-11 | International Business Machines Corporation | Hierarchical exploration of longitudinal medical events |
US20160283686A1 (en) * | 2015-03-23 | 2016-09-29 | International Business Machines Corporation | Identifying And Ranking Individual-Level Risk Factors Using Personalized Predictive Models |
CA2957002A1 (en) * | 2015-04-21 | 2016-10-27 | Medaware Ltd. | Medical system and method for predicting future outcomes of patient care |
WO2018175970A1 (en) * | 2017-03-24 | 2018-09-27 | The Brigham And Women's Hospitla, Inc. | Systems and methods for automated treatment recommendation based on pathophenotype identification |
-
2017
- 2017-04-04 US US15/478,282 patent/US11574707B2/en active Active
-
2023
- 2023-02-02 US US18/105,004 patent/US20230178189A1/en active Pending
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6216249B1 (en) * | 1999-03-03 | 2001-04-10 | Cirrus Logic, Inc. | Simplified branch metric for reducing the cost of a trellis sequence detector in a sampled amplitude read channel |
US6417802B1 (en) * | 2000-04-26 | 2002-07-09 | Litton Systems, Inc. | Integrated inertial/GPS navigation system |
US20020065805A1 (en) * | 2000-11-30 | 2002-05-30 | Beals Thomas P. | Method for organizing laboratory information in a database |
US20020119451A1 (en) * | 2000-12-15 | 2002-08-29 | Usuka Jonathan A. | System and method for predicting chromosomal regions that control phenotypic traits |
US20040172267A1 (en) * | 2002-08-19 | 2004-09-02 | Jayendu Patel | Statistical personalized recommendation system |
US20070192139A1 (en) * | 2003-04-22 | 2007-08-16 | Ammon Cookson | Systems and methods for patient re-identification |
US20060052945A1 (en) * | 2004-09-07 | 2006-03-09 | Gene Security Network | System and method for improving clinical decisions by aggregating, validating and analysing genetic and phenotypic data |
US20070260492A1 (en) * | 2006-03-09 | 2007-11-08 | Microsoft Corporation | Master patient index |
US20070239551A1 (en) * | 2006-03-30 | 2007-10-11 | Zeller Michelle G | Method and apparatus for a product ordering system |
US20110206246A1 (en) * | 2008-04-21 | 2011-08-25 | Mts Investments Inc. | System and method for statistical mapping between genetic information and facial image data |
US20150142331A1 (en) * | 2012-12-14 | 2015-05-21 | Celmatix, Inc. | Methods and devices for assessing risk of female infertility |
US20160275259A1 (en) * | 2013-11-01 | 2016-09-22 | Koninklijke Philips N.V. | Patient feedback for uses of therapeutic device |
US20180004803A1 (en) * | 2015-02-20 | 2018-01-04 | Hewlett-Packard Development Company, L.P. | Iterative visualization of a cohort for weighted high-dimensional categorical data |
US20190057182A1 (en) * | 2015-05-22 | 2019-02-21 | Csts Health Care Inc. | Biomarker-driven molecularly targeted combination therapies based on knowledge representation pathway analysis |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220044826A1 (en) * | 2018-12-31 | 2022-02-10 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US11875903B2 (en) * | 2018-12-31 | 2024-01-16 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US20230154628A1 (en) * | 2020-04-07 | 2023-05-18 | Nippon Telegraph And Telephone Corporation | Analysis apparatus, analysis method and program |
CN112967817A (en) * | 2021-02-02 | 2021-06-15 | 武汉大学 | Epidemiological research population screening method based on medical big data and storage medium |
Also Published As
Publication number | Publication date |
---|---|
US20230178189A1 (en) | 2023-06-08 |
US11574707B2 (en) | 2023-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rehman et al. | Leveraging big data analytics in healthcare enhancement: trends, challenges and opportunities | |
US12009069B2 (en) | Synthesizing complex population selection criteria | |
US20230178189A1 (en) | System and method for phenotype vector manipulation of medical data | |
Kumar et al. | Big data analytics for healthcare industry: impact, applications, and tools | |
US12057204B2 (en) | Health care information system providing additional data fields in patient data | |
US20220148695A1 (en) | Information system providing explanation of models | |
US8700649B2 (en) | Analyzing administrative healthcare claims data and other data sources | |
CN110136837B (en) | Medical data processing platform | |
CN110148440A (en) | A kind of medical information querying method | |
CN110119432A (en) | A kind of data processing method for medical platform | |
US20230147366A1 (en) | Systems and methods for data normalization | |
CN111145846A (en) | Clinical trial patient recruitment method and device, electronic device and storage medium | |
Grundmeier et al. | Identifying surgical site infections in electronic health data using predictive models | |
De Oliveira et al. | “Bow-tie” optimal pathway discovery analysis of sepsis hospital admissions using the Hospital Episode Statistics database in England | |
Baron | Artificial Intelligence in the Clinical Laboratory: An Overview with Frequently Asked Questions | |
Saravanan et al. | Foundation of big data and internet of things: Applications and case study | |
Birtwell et al. | Carnival: A Graph-Based Data Integration and Query Tool to Support Patient Cohort Generation for Clinical Research | |
Anandi et al. | Descriptive and Predictive Analytics on Electronic Health Records using Machine Learning | |
Dunn et al. | A cloud-based pipeline for analysis of FHIR and long-read data | |
Mahanty et al. | Medical data analysis in eHealth care for industry perspectives: applications | |
Al Shehri et al. | A smart pain management system using big data computing | |
Alghamdi | Health data warehouses: reviewing advanced solutions for medical knowledge discovery | |
US20230153757A1 (en) | System and Method for Rapid Informatics-Based Prognosis and Treatment Development | |
Kumar et al. | Big data in healthcare: Applications and challenges | |
US20240370404A1 (en) | Systems and methods for metadata driven normalization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: QUINTILESIMS INCORPORATED, CONNECTICUT Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WICKSON, JONATHAN;MURRAY, ROBIN;SIGNING DATES FROM 20170331 TO 20170402;REEL/FRAME:041839/0836 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: IQVIA INC., NEW JERSEY Free format text: CHANGE OF NAME;ASSIGNOR:QUINTILES IMS INCORPORATED;REEL/FRAME:047207/0276 Effective date: 20171106 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: U.S. BANK TRUST COMPANY, NATIONAL ASSOCIATION, MINNESOTA Free format text: SECURITY INTEREST;ASSIGNORS:IQVIA INC.;IQVIA RDS INC.;IMS SOFTWARE SERVICES LTD.;AND OTHERS;REEL/FRAME:063745/0279 Effective date: 20230523 |
|
AS | Assignment |
Owner name: BANK OF AMERICA, N.A., AS ADMINISTRATIVE AGENT, NORTH CAROLINA Free format text: SECURITY INTEREST;ASSIGNORS:IQVIA INC.;IMS SOFTWARE SERVICES, LTD.;REEL/FRAME:064258/0577 Effective date: 20230711 |
|
AS | Assignment |
Owner name: U.S. BANK TRUST COMPANY, NATIONAL ASSOCIATION, MINNESOTA Free format text: SECURITY INTEREST;ASSIGNOR:IQVIA INC.;REEL/FRAME:065709/0618 Effective date: 20231128 Owner name: U.S. BANK TRUST COMPANY, NATIONAL ASSOCIATION, MINNESOTA Free format text: SECURITY INTEREST;ASSIGNORS:IQVIA INC.;IQVIA RDS INC.;IMS SOFTWARE SERVICES LTD.;AND OTHERS;REEL/FRAME:065710/0253 Effective date: 20231128 |
|
AS | Assignment |
Owner name: U.S. BANK TRUST COMPANY, NATIONAL ASSOCIATION, MINNESOTA Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE CONVEYING PARTIES INADVERTENTLY NOT INCLUDED IN FILING PREVIOUSLY RECORDED AT REEL: 065709 FRAME: 618. ASSIGNOR(S) HEREBY CONFIRMS THE SECURITY AGREEMENT;ASSIGNORS:IQVIA INC.;IQVIA RDS INC.;IMS SOFTWARE SERVICES LTD.;AND OTHERS;REEL/FRAME:065790/0781 Effective date: 20231128 |