EP3446245A1 - Appariement hospitalier de bases de données de soins de santé anonymisées sans quasi-identificateurs évidents - Google Patents

Appariement hospitalier de bases de données de soins de santé anonymisées sans quasi-identificateurs évidents

Info

Publication number
EP3446245A1
EP3446245A1 EP17720392.4A EP17720392A EP3446245A1 EP 3446245 A1 EP3446245 A1 EP 3446245A1 EP 17720392 A EP17720392 A EP 17720392A EP 3446245 A1 EP3446245 A1 EP 3446245A1
Authority
EP
European Patent Office
Prior art keywords
databases
anonymized
pair
patient
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP17720392.4A
Other languages
German (de)
English (en)
Inventor
Reza SHARIFI SEDEH
Daniel Robert ELGORT
Roel Truyen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koninklijke Philips NV
Original Assignee
Koninklijke Philips NV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips NV filed Critical Koninklijke Philips NV
Publication of EP3446245A1 publication Critical patent/EP3446245A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • HIPAA Health Insurance Portability and Accountability Act
  • PII patient- identifying information
  • Information that needs to be anonymized includes patient name and/or medical identification number (suitably replaced by a randomly assigned number or the like), address, or so forth.
  • Other anonymization measures may include removing "rare" patients who might be identifiable by a combination of unusual characteristic - for example, a patient who is 102 years old with a particular illness might be identified on the basis of that information alone.
  • a patient In addition to rare patients, a patient might be identifiable based on timestamp information for events recorded in the patient record. For example, if a patient is admitted to the hospital on a certain date with a certain condition, that information may be sufficient to narrow the number of possible patient identifications to a small number.
  • longitudinal information that is, the time sequence of events and the time intervals between various events, is sometimes useful in healthcare data analytics. For example, the time interval between admission and discharge may be useful or even critical for analyzing hospital efficiency and/or effectiveness of a certain treatment.
  • the timestamps are shifted by some random amount (generally different for each patient), using a rigid shift for all timestamped events of a given patient.
  • the random rigid time shift in timestamps makes patient identification via timestamp more difficult, while the use particularly of a rigid time shift retains the longitudinal information, i.e. the time interval information between events.
  • an anonymized healthcare data source device comprises at least one electronic processor programmed to integrate N anonymized healthcare databases (10) where N is a positive integer having a value of at least three by performing a database integration process including the operations of: for a pair of databases (i,j) of the N anonymized healthcare databases, identifying a set of features each contained in both databases i and j of the pair of databases (i,j) and generating a conversion table matching patients of the pair of databases based on patient similarity measured by the set of features; repeating the identifying and generating operations for each unique pair of databases of the N anonymized healthcare databases to generate N(N-l)/2 conversion tables.
  • the at least one electronic processor is further programmed to perform a patient data retrieval process including the operation of retrieving patient data for one or more anonymized patients contained in the N anonymized healthcare databases using the N(N- l)/2 conversion tables.
  • an anonymized healthcare data source device comprises at least one electronic processor programmed to integrate a healthcare database i and a healthcare database j by performing a database integration process including the operations of: for the pair of databases (i,j), identifying a set of features each contained in both databases i and j of the pair of databases (i,j) including at least one longitudinal feature defined by a pair of timestamped events separated by a time interval At between the timestamps of the events and generating a conversion table matching patients of the pair of databases (i,j) based on patient similarity measured by the set of features including comparison of the time interval At for patients in the two databases (i,j).
  • the at least one electronic processor is further programmed to perform a patient data retrieval process including the operation of retrieving patient data for one or more anonymized patients contained in both anonymized healthcare databases (i,j) using the conversion table matching patients of the pair of databases (i,j).
  • a non-transitory storage medium stores instructions readable and executable by a computer to perform an anonymized population image reconstruction method to reconstruct an anonymized population image from N anonymized healthcare databases where N is a positive integer having a value of at least two.
  • the anonymized population image reconstruction method comprises: for a pair of databases (i,j) of the N anonymized healthcare databases, identifying a set of features each contained in both databases i and j of the pair of databases (i,j) and generating a conversion table matching patients of the pair of databases based on patient similarity measured by the set of features.
  • the identifying and generating operations are repeated for each unique pair of databases of the N anonymized healthcare databases to generate the anonymized population image comprising contents of the N anonymized healthcare databases integrated by the N(N-l)/2 conversion tables.
  • One advantage resides in providing for integration of two, three, four, or more anonymized healthcare databases to leverage the combined data contained in the databases for healthcare data analytic tasks.
  • Another advantage resides in providing for the foregoing in which one or more anonymized healthcare databases is an unstructured healthcare database.
  • Another advantage resides in providing the foregoing in which longitudinal information, that is, time intervals between events, is leveraged in matching anonymized patients in different anonymized healthcare databases.
  • a given embodiment may provide none, one, two, more, or all of the foregoing advantages, and/or may provide other advantages as will become apparent to one of ordinary skill in the art upon reading and understanding the present disclosure.
  • the invention may take form in various components and arrangements of components, and in various steps and arrangements of steps.
  • the drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
  • FIGURE 1 diagrammatically illustrates a medical analytics device that leverages anonymized patient data integrated from two or more anonymized healthcare databases.
  • FIGURE 2 diagrammatically illustrates an embodiment of the databases integration process performed by the device of FIGURE 1 configured to integrate three or more anonymized healthcare databases.
  • FIGURE 3 shows a table diagrammatically demonstrating criterion for selection of different feature integrating different anonymized healthcare databases.
  • FIGURE 4 diagrammatically shows operation of the refinement component of the databases integration process embodiment of FIGURE 2.
  • FIGURE 5 diagrammatically shows an embodiment of the databases integration process of FIGURE 1 that leverages longitudinal information.
  • an “anonymized healthcare database” may be (by way of illustration): a medical records database, such as an anonymized database extracted from a comprehensive Electronic Medical Record (EMR) or a domain-specific medical database such as a cardiovascular information system (CVIS) or an intensive care unit (ICU) information system; an anonymized database extracted from a hospital billing department database; an anonymized database extracted from a medical insurance company database; an anonymized database extracted from a hospital admissions departmental database; or so forth.
  • An anonymized database extracted from a CVIS can be expected to contain medical records pertaining to diagnosis and treatment of cardiovascular disease, but may not include information on insurance coverage for those diagnoses/treatments.
  • an anonymized database extracted from the hospital billing department can be expected to contain insurance reimbursement information but not medical diagnosis/treatment data. Combining these databases could provide a more holistic image of the patient population; but the limited content overlap between the two databases which provides motivation for the integration also makes such integration challenging.
  • these problems are overcome by leveraging the integration of multiple (three or more) healthcare databases.
  • This can provide a greater degree of overlap overall which motivates toward performing integration of the N databases in a single process; however, paradoxically it is disclosed herein that a more efficient and reliable approach for performing the integration is to first integrate each pair of anonymized healthcare databases, so as to generate a conversion table for each pair, and then refine the resulting N(N-l)/2 conversion tables based on consistency of patient matching between the N(N-l)/2 conversion tables.
  • This approach recognizes that the overlap of features between the N databases is likely to be small, and moreover even where overlap is present certain features may be unreliable in some databases.
  • a set of features can be chosen for each such pairwise integration that is well-chosen for that pair of anonymized healthcare databases.
  • the additional information provided by the multiple (N>2) databases is then leveraged in the subsequent refinement step, which in some embodiments does not rely upon the features.
  • a longitudinal feature is defined by a pair of timestamped events for a single anonymized patient in an anonymized healthcare database which are separated by a time interval At between the timestamps of the events.
  • Such longitudinal features are well-defined even in an anonymized healthcare database in which the anonymization process introduces a random, but rigid, shift of all timestamps for each patient, since the rigid time shift does not affect the time intervals At between events.
  • N anonymized healthcare databases 10 are denoted “Database 1", “Database 2", ..., “Database N”, respectively.
  • the anonymized healthcare databases 10 are generated by a suitable anonymization process (not shown), which are preferably automated (e.g. computer-implemented, with the computers programmed to remove certain classes or types of data) in order to anonymize large databases, e.g. a million patient entries or more in some embodiments.
  • the anonymization may also include some manual processing, for example to remove certain rare patients or to address other unusual situations.
  • the anonymization processes employed to generate the N anonymized databases may in general be different, and/or may or may not anonymize the same information.
  • Each anonymization process preferably anonymizes personally identifying information (PII) that can immediately identify a patient, such as patient names, patient addresses, social security numbers, or so forth, as well as information that could potentially be PII in combination with other information, such as hospital name, zip code, or so forth.
  • PII personally identifying information
  • information may be PII in combination with other information, it may be sufficient to anonymize only a portion of the combination. For example, the combination of zip code, gender, and date of birth may be personally identifying - but by anonymizing only the zip code information acceptable patient anonymity may be achieved.
  • the anonymization process(es) may optionally also remove rare information that could be identifying for certain patients, such as any age over a certain maximum, e.g. 90 years old, and/or diagnoses that are not among a list of common diagnoses, and/or so forth.
  • the anonymization of a particular datum can be done by removing the data (redaction) or by replacing the data with a placeholder, the latter being preferable in situations where correlations with that particular type of information are desirably retained, albeit with anonymization.
  • medical care unit e.g. hospital or care unit
  • entries may be replaced by placeholders that are internally consistent for the database. These placeholders are internally consistent within a given database, but vary essentially randomly between databases.
  • the hospital "Blackacre General Hospital” may be always replaced by the placeholder, e.g. "8243", while “Whiteacre Community Medical Center” may be always replaced by the placeholder "1238".
  • every instance of medical care unit "Blackacre General Hospital” in Database 1 is replaced by (same) placeholder medical care unit "8243” and every instance of medical care unit "Whiteacre Community Medical Center” in Database 1 is replaced by the (same) placeholder medical care unit "1238".
  • each instance of medical care unit "Blackacre General Hospital” in Database 2 may be replaced by the same placeholder medical care unit "EADF” (which is different from the placeholder "8243” used for Blackacre in anonymized Database 1), and each instance of "Whiteacre Community Medical Center” may be replaced by the same placeholder medical care unit "JSDF” (which again is different from the placeholder "1238” used for Whiteacre in anonymized Database 1).
  • EDF placeholder medical care unit
  • JSDF placeholder medical care unit
  • Such anonymization of medical care units by medical care unit placeholders that are internally consistent within the anonymized database enables a healthcare data analytic process operating on a database to identify correlations with a particular medical care unit while maintaining patient anonymity. For example, if Blackacre has a statistically significantly higher success rate for heart transplants than the average hospital, this will show up in Database 1 (assuming it stores heart transplant outcome data) as a statistically significantly higher success rate for heart transplants performed at anonymized hospital "8243".
  • some information may be anonymized by redaction, that is, removal.
  • residential address information may be redacted entirely, as this is highly identifying and useful correlations with residential address may not be expected for a typical healthcare data analytic process.
  • address anonymization may be performed by replacing each residential address by a broader geographical area, e.g. the residential city if this city has a sufficiently large population to assure an acceptable level of anonymity.
  • a residential city or county with sufficiently small population may be redacted entirely to avoid retaining "rare" data that could be personally identifying, or may be replaced by a suitably larger geographical unit such as the residential state.
  • the anonymized healthcare databases 10 are generally expected to each be formatted in some structured format, for example in a relational database format or other structured database format, as spreadsheets, searchable column-delimited rich text files, or so forth.
  • one or more of the databases 10 may be an unstructured database, for example storing written text reports on patients, or may have limited structure, e.g. a structured heading providing information such as patient name and demographic information followed by unstructured text reports.
  • natural language processing may be employed to extract structured representations of the database contents, such as bag-of- words representations of text documents.
  • a medical data analytics device includes an anonymized healthcare data source device 12 implemented on a computer 14 (or, more generally, an electronic processor 14), which may for example be a network-based server computer, a cloud computing resource, a server cluster, or so forth.
  • the computer 14 is programmed to perform a database integration process 16 and a patient data retrieval process 18, the latter making use of a set of N(N-l)/2 conversion tables 20.
  • each conversion table is an mx2 conversion table for a pair of databases of the N databases 10.
  • the databases of the pair are denoted as database i and database j, respectively, collectively forming the pair of databases (i,j).
  • Each conversion table is an mx2 table having rows (or, alternatively columns) for m patients matched in the pair of databases (i,j) by the database integration process 16, and two columns (or, alternatively, rows) one listing the anonymized patient ID in anonymized database i and the other listing the anonymized patient ID in anonymized database j.
  • N 2 there is a single pair of databases (i,j).
  • N(N-l)/2 conversion tables 20 to be embodied as a single table, e.g. a concatenation of the N(N-l)/2 tables each of dimension mx2 to form a single mx[N(N-l)] table.
  • the computer 14 is also programmed to perform the patient data retrieval process 18 to retrieve anonymized patient data from the N anonymized healthcare databases 10 using the N(N-l)/2 conversion tables 20.
  • a query may be submitted to the patient data retrieval process 18 to acquire the value of a query feature for a given patient identified by an anonymized patient ID used in Database 1.
  • the query feature may not be contained in all N databases. If the query feature is contained in only one of the N anonymized healthcare databases then the query feature is retrieved from the (single) anonymized healthcare database containing the query feature. On the other hand, if the query feature is contained in two or more of the N anonymized healthcare databases, then a retrieved value is generated for the query feature from the values of the query feature in the two or more of the N anonymized healthcare databases containing the query feature. This may be done, for example, using a feature accuracy metric for the query feature in the respective anonymized healthcare databases containing the query feature.
  • the query requests the primary diagnosis for patient 49 and Databases 1, 2, and 3 each contain a primary diagnosis field
  • this provides three values for primary diagnosis of patient 49 (after conversion of the anonymized patient ID 49 for the Databases 2 and 3 using appropriate mx2 conversion tables).
  • Databases 1 and 3 are known to have accuracy rates of 97% for primary diagnosis while Database 2 has a much lower accuracy rate (e.g. 71%) for this feature, then the retrieved value is generated as the primary diagnosis obtained from Databases 1 and 3 which are most likely to be accurate.
  • various approaches can be used to generate the retrieved value, such as taking the value for the database of the N databases 10 having the highest accuracy metric for that feature, or taking the most common value (e.g.
  • the queries received and processed by the patient data retrieval process 18 may vary depending upon the purpose of the query. For example, it may be desired to obtain the primary diagnosis for all male patients in the age range 30-50 years old - in this case the query might be formulated as a request for the set of primary diagnoses (with an enumeration for each different diagnosis) after appropriate filtering by age and gender.
  • the query result in this case may be the set of data pairs ⁇ (diagnosis,count) ⁇ where each element (diagnosis,count) stores a text string indicating the diagnosis and a count of the number of patients (after age/gender filtering) having that diagnosis.
  • the N databases 10 are relational databases
  • the patient data retrieval process 18 may be implemented as a Structured Query Language (SQL) query engine that receives SQL queries.
  • SQL Structured Query Language
  • the healthcare data analytics device further includes a healthcare data analytics tool 22 implemented on a computer 24 (or, more generally, an electronic processor 24), which may for example be a network-based server computer, a cloud computing resource, a server cluster, a desktop computer (as illustrated), or so forth.
  • the computer 24 includes or is operatively connected with one or more display components/devices 26 and one or more user input components/devices such as an illustrative keyboard 28, a mouse or other pointing device 30, a touch-sensitive overlay of the display 26, and/or so forth.
  • the healthcare data analytics tool 22 performs various healthcare analytics such as (by way of illustrative example): assessing insurance coverage for a certain medical procedure; determining survival rates for a medical procedure; assessing demographic correlations with types of medical care most commonly provided to the patient; or so forth.
  • a user operates the user input device(s) 28, 30 to configure the type of analytic to be performed; the healthcare data analytics tool 22 retrieves appropriate data from the anonymized databases 10 via the patient data retrieval process 18 of the anonymized healthcare data source device 12 and performs the chosen analytical analyses on that data; and the results are presented on the display component(s) 26 as graphical representations or the like, e.g.
  • plotting insurance coverage for a procedure as a histogram binned by date interval, or as a pie chart showing insurance coverage for a procedure with slices corresponding to different insurance companies; or plotting survival rate as a function of geographical location; et cetera.
  • the illustrative anonymized healthcare data source device 12 is shown in FIGURE 1 as being implemented on the computer 14, while the healthcare data analytic tool 22 is shown in FIGURE 1 as being implemented on the different computer 24.
  • the anonymized healthcare data source device and the healthcare data analytic tool may be implemented on a single computer.
  • Other hardware segmentation topologies are also contemplated, e.g. the databases integration process 16 and the patient data retrieval process 18 could be implemented on different computers.
  • the disclosed functionality of the healthcare data analytics device as described herein may be embodied as a non-transitory storage medium storing instructions that are readable and executable by an electronic processor 14, 24 to perform the disclosed functionality.
  • the non-transitory storage medium may, for example, comprise a hard disk drive or other magnetic storage medium, an optical disk or other optical storage medium, a flash memory, read-only memory (ROM), or other electronic storage medium, various combinations thereof, or so forth.
  • N is at least three and more generally N could be any positive integer greater than or equal to three.
  • a (first) pair of anonymized healthcare databases (i,j) are selected from the N databases 10.
  • an illustrative example is described for matching patients in the chosen databases (i,j).
  • inclusion/exclusion criteria are applied to select the database portions to match.
  • the subsets of the two databases that are possibly related are extracted. For example, if Database i covers only the data of Medical- surgical and Burn-Trauma ICU patients, from Database j, the subset of patients who were admitted to Medical- surgical and Burn-Trauma ICU wards during their hospitalizations are extracted (i.e. included) while data from other areas that do not overlap Database i are excluded.
  • the excluded/included data is determined by the overlap for the particular database pair (i,j) and may differ for different pairs.
  • a set of features is identified for use in integrating the database pair (i,j).
  • a set of non-uniquely identifying features is selected with which Database i and Database j can be reliably integrated.
  • the selected features are each contained in both databases i and j of the pair of databases (i,j).
  • the selected features are optionally chosen based on available information on reliability. For example, if it is known that one of the databases relatively inaccurate in terms of patients' gender records, but both Database i and Database j are accurate in terms of body weight records, then body weight is suitably chosen as a feature, and gender is suitably not chosen as a feature.
  • FIGURE 3 shows a table of features for three anonymized healthcare Databases X, Y, and Z, tabulating the accuracy as a percentage for each feature in each database.
  • the last three rows of the table shown in FIGURE 3 indicate whether each feature should be selected as the set of features for the indicated database combination i-j.
  • FIGURE 3 indicates that Databases X and Y are both accurate in the recording of race, mortality, length of stay, age and body weight, and so these five features are selected for matching Databases X and Y.
  • the set of features race, length of stay, age, primary diagnosis, and body weight is suitably chosen to integrate Database X and Database Z; and the set of features: gender, race, length of stay, age, and body weight is suitably chosen to integrate Database Y and Database Z.
  • the set of features chosen in the operation 44 are used to match patients in Databases i and j.
  • Various approaches can be used. In a straightforward approach, a match is found between two patients in Database i and Database j, respectively, if a threshold fraction (or number) of available values for features of the set of features match.
  • the matching can apply different weights to the different features based on factors such as the likelihood of having an erroneous recorded feature value in the database, the selectivity of the feature, and so forth.
  • each patient in Database i is represented by a feature vector whose elements store the values of the set of features selected in operation 44
  • each patient in Database j is represented by a feature vector whose elements store the values of the set of features selected in operation 44.
  • Some of these values may be blank (e.g. the vector stores a ⁇ null> or other placeholder).
  • Any approach for computing the similarity of two such feature vectors can be used to compare patients and identify similar patients in the two databases. For example, if the number of features is F then a suitable similarity measure may be the distance between the two feature vectors p t and pj given by:
  • p t and pj are feature vectors representing a patient being compared in Database i and a patient being compared in Database j, respectively, and Pi (f represents the value of the f th feature for patient p t and likewise Pj (f) represents the value of the f th feature for patient pj .
  • Any missing features can be dealt with in various ways, such as simply omitting them from the sum forming D (pi, Pj) (and scaling 1/F accordingly), or assigning some default value for p t ( )— pj ( ) in the case of a missing feature /. It is to be appreciated that the foregoing is merely an illustrative example and that substantially any other comparison formalism may be used to identify matching patients in the respective Databases i and j.
  • the cross-database patient matches identified in the operation 46 are tabulated in a patient ID conversion table for the database pair (i,j).
  • this table may be an mx2 table such as:
  • the storage is by way of duplicate entries for Database i Patient ID 5, which has the advantage of facilitating sorting the table on either the patient IDs of Database i or the patient IDs of Database j.
  • a decision operation 50 the processing repeats for each unique pair of databases (i,j) in the set of N databases 10 being integrated, in order to generate a patient ID conversion table for each unique pair of databases (i,j).
  • the output of the N(N-l)/2 loop iterations is the N(N-l)/2 conversion tables for the N(N-l)/2 unique database pairs of the N databases 10. In some embodiments, this is the final output providing the N(N-l)/2 conversion tables 20 (each of dimensions mx2) used by the patient data retrieval process 18. However, if the database integration process 12 terminates at this point then information from the multiple (three or more) healthcare databases (i.e. N>3) is not effectively leveraged to improve the individual mx2 pairwise conversion tables.
  • a refinement operation 52 is performed after the N(N-l)/2 conversion tables are constructed, which refines the N(N-l)/2 conversion tables based on consistency of patient matching between the N(N-l)/2 conversion tables.
  • the refinement operation 52 does not use the sets of features identified in the iterations of the operation 44 - rather, the refinement operation 52 is performed as diagrammatically shown in FIGURE 4, by taking into account the expected consistency between the N(N-l)/2 conversion tables.
  • each circle represents a single anonymous patient labeled with his/her anonymized patient ID (e.g.
  • Patient 1 in Database X is linked to Patient 22 in Database Y based on the X-Y conversion table. To maintain consistency, both Patient 1 in Database X and Patient 22 in Database Y should be linked to the same patient in Database Z.
  • such consistency analysis could be performed during the iterative loop 40, 42, 44, 46, 48, 50.
  • This approach and reduce processing time for performing later loop iterations by leveraging the already-created pairwise conversion tables. For example, consider the case of N 3 with the databases indexed X, Y, and Z, and with the iterative loop 40, 42, 44, 46, 48, 50 being performed to create the X-Y, X-Z, and Y-Z conversion tables in that order. After creation of the X-Y and X-Z conversion tables it may thereby be known that Patient 10 of Database X is linked to Patient 11 of Database Y, and that Patient 10 of Database X is also linked to Patient 15 of Database Z.
  • a longitudinal feature is defined by a pair of timestamped events for a single anonymized patient in an anonymized healthcare database which are separated by a time interval At between the timestamps of the events.
  • Such longitudinal features are well-defined even in an anonymized healthcare database in which the anonymization process introduces a random, but rigid, shift of all timestamps for each patient, since the rigid time shift does not affect the time intervals At between events.
  • the longitudinal feature is defined by an event of type e followed by an event of type f which are separated by time interval At.
  • a Patient m in Database X has an occurrence of an event of event type e followed by an occurrence of an event of event type f which are separated by time interval At.
  • a Patient n in Database Y has an occurrence of an event of event type e followed by an occurrence of an event of event type f which are separated by the same time interval At.
  • a Patient p in Database Z has an event of event type e followed by an event of event type f - however, the time interval between the events of types e and f, respectively, is much greater than the time interval At.
  • the Patient m in Database X matches the Patient n in Database Y but does not match the patient p in Database Z. In matching such longitudinal features it is contemplated to allow for some variation in At for the patients in different databases to account, for example, for possible errors in entry of the timestamps.
  • the allowable variation in At may be large enough that practically the longitudinal feature is matched if the events of types e- f occur in sequence regardless of the time interval between them (within some limit defined by the allowable variation in At).
  • the illustrative longitudinal features employ the time interval At between events, rather than comparing timestamps of events for patients in the two databases (i,j). As discussed previously, this approach relying upon time intervals between events, rather than relying on absolute timestamps of events, is robust against the possibility that the patient timeline was rigidly shifted by a random amount as part of the anonymization process.
  • the longitudinal features are treated like other features of the set of features identified in operation 44 and used in operation 46 (see FIGURE 2).
  • the rather high specificity of longitudinal features means they can be highly discriminatory for matching patients.
  • the patient matching operation 46 is initially performed without reliance upon longitudinal features, with the longitudinal features being computed and leveraged only for difficult matches (e.g., a patient in Database X that matches more than one patient in Database Y when only the non-longitudinal features are used).
  • the non-longitudinal feature matching is performed (or is performed in part) using a universal patient ID (or UID) for each patient.
  • the UID is constructed as a concatenation of a set of common features such as the patient's gender, race, age, and body weight.
  • the UID 1518170 for a patient could be generated using their following features: Male or Gender 1 (the first digit of 1518170); Native American or Race 5 (the second digit of 1518170), Age of 18 years (the third and fourth digits of 1518170) and body weight of 170 pounds (the fifth, sixth, and seventh digits of 1518170).
  • a UID is assigned to the patient-record. Since the UID is feature -based, it should be the same across different anonymized databases. Optionally, some tolerance is accepted, e.g. Age of 80 in Database II is considered to be the same as Age of 79-81 in Database I, when using the tolerance threshold of +1 year for Age.
  • Such a UID approach for feature matching may be employed for all features of the set of features used to match the patient, or alternatively a smaller sub-set of features may be concatenated to form the UID, where the set of features forming the UID are common to all N databases 10.
  • the process of integrating the N anonymized healthcare databases 10 can be viewed as an anonymized population image reconstruction method to reconstruct an anonymized population image from the N anonymized healthcare databases 10.
  • the reconstructed anonymized population image comprises contents of the N anonymized healthcare databases 10 integrated by the N(N-l)/2 conversion tables 20.
  • the anonymized population image reconstruction method reconstructs (or transforms) population imaging data in the form of the N anonymized healthcare databases 10 into the anonymized population image comprising the contents of the N anonymized healthcare databases 10 integrated by the N(N-l)/2 conversion tables 20.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

Un processeur électronique (14) est programmé pour effectuer l'intégration (16) de N bases de données de soins de santé rendues anonymes (10). Pour une paire de bases de données (i, j) des N bases de données de soins de santé rendues anonymes, un ensemble de caractéristiques est identifié (44), chacune contenue dans les deux bases de données i et j de la paire de bases de données (i, j). Une table de conversion est générée (46, 48) qui met en correspondance des patients de la paire de bases de données sur la base d'une similitude de patient mesurée par l'ensemble de caractéristiques. Les opérations d'identification et de génération sont répétées (50) pour chaque paire unique de bases de données des N bases de données de soins de santé rendues anonymes pour générer N(N 1)/2 tables de conversion (20). Le processeur électronique est en outre programmé pour effectuer un processus de récupération de données de patient (18) qui reçoit un ID de patient d'un patient dans l'une des N bases de données de soins de santé rendues anonymes et récupère des données de patient pour le patient contenues dans les N bases de données de soins de santé rendues anonymes à l'aide des N(N-1)/2 tables de conversion.
EP17720392.4A 2016-04-19 2017-04-19 Appariement hospitalier de bases de données de soins de santé anonymisées sans quasi-identificateurs évidents Withdrawn EP3446245A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662324363P 2016-04-19 2016-04-19
PCT/EP2017/059266 WO2017182509A1 (fr) 2016-04-19 2017-04-19 Appariement hospitalier de bases de données de soins de santé anonymisées sans quasi-identificateurs évidents

Publications (1)

Publication Number Publication Date
EP3446245A1 true EP3446245A1 (fr) 2019-02-27

Family

ID=58645023

Family Applications (1)

Application Number Title Priority Date Filing Date
EP17720392.4A Withdrawn EP3446245A1 (fr) 2016-04-19 2017-04-19 Appariement hospitalier de bases de données de soins de santé anonymisées sans quasi-identificateurs évidents

Country Status (5)

Country Link
US (1) US20190147988A1 (fr)
EP (1) EP3446245A1 (fr)
JP (1) JP6956107B2 (fr)
CN (1) CN109074858B (fr)
WO (1) WO2017182509A1 (fr)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200024743A (ko) * 2017-07-07 2020-03-09 파나소닉 아이피 매니지먼트 가부시키가이샤 정보 제공 방법, 정보 처리 시스템, 정보 단말, 및 정보 처리 방법
WO2019189969A1 (fr) * 2018-03-30 2019-10-03 주식회사 그리즐리 Procédé d'anonymisation d'informations personnelles volumineuses et procédé de combinaison des données anonymes
US20200117833A1 (en) * 2018-10-10 2020-04-16 Koninklijke Philips N.V. Longitudinal data de-identification
US20220215129A1 (en) * 2019-05-21 2022-07-07 Nippon Telegraph And Telephone Corporation Information processing apparatus, information processing method and program
US12045366B2 (en) 2019-05-21 2024-07-23 Nippon Telegraph And Telephone Corporation Information processing apparatus, information processing method and program
US12067150B2 (en) 2019-05-21 2024-08-20 Nippon Telegraph And Telephone Corporation Information processing apparatus, information processing method and program for anonymizing data
US11641346B2 (en) 2019-12-30 2023-05-02 Industrial Technology Research Institute Data anonymity method and data anonymity system
US11670406B2 (en) * 2020-04-29 2023-06-06 Fujifilm Medical Systems U.S.A., Inc. Systems and methods for removing personal data from digital records
US20220351814A1 (en) * 2021-05-03 2022-11-03 Udo, LLC Stitching related healthcare data together
CN114579626B (zh) * 2022-03-09 2023-08-11 北京百度网讯科技有限公司 数据处理方法、数据处理装置、电子设备和介质

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6440066B1 (en) * 1999-11-16 2002-08-27 Cardiac Intelligence Corporation Automated collection and analysis patient care system and method for ordering and prioritizing multiple health disorders to identify an index disorder
US20020073138A1 (en) * 2000-12-08 2002-06-13 Gilbert Eric S. De-identification and linkage of data records
US7519591B2 (en) * 2003-03-12 2009-04-14 Siemens Medical Solutions Usa, Inc. Systems and methods for encryption-based de-identification of protected health information
CN1759413A (zh) * 2003-03-13 2006-04-12 西门子医疗健康服务公司 访问患者信息的系统
US7543149B2 (en) * 2003-04-22 2009-06-02 Ge Medical Systems Information Technologies Inc. Method, system and computer product for securing patient identity
JP4183725B2 (ja) * 2006-11-27 2008-11-19 株式会社野村総合研究所 データベース利用システム及びデータベース利用プログラム
US9355273B2 (en) * 2006-12-18 2016-05-31 Bank Of America, N.A., As Collateral Agent System and method for the protection and de-identification of health care data
JP2009070096A (ja) * 2007-09-12 2009-04-02 Michio Kimura ゲノム情報と臨床情報との統合データベースシステム、および、これが備えるデータベースの製造方法
US8799282B2 (en) * 2007-09-28 2014-08-05 International Business Machines Corporation Analysis of a system for matching data records
US8612258B2 (en) * 2008-10-31 2013-12-17 General Electric Company Methods and system to manage patient information
US8898798B2 (en) * 2010-09-01 2014-11-25 Apixio, Inc. Systems and methods for medical information analysis with deidentification and reidentification
US10607726B2 (en) * 2013-11-27 2020-03-31 Accenture Global Services Limited System for anonymizing and aggregating protected health information
US20150193583A1 (en) * 2014-01-06 2015-07-09 Cerner Innovation, Inc. Decision Support From Disparate Clinical Sources
JP5649756B1 (ja) * 2014-08-08 2015-01-07 株式会社博報堂Dyホールディングス 情報処理システム、及び、プログラム。
US20160085915A1 (en) * 2014-09-23 2016-03-24 Ims Health Incorporated System and method for the de-identification of healthcare data

Also Published As

Publication number Publication date
CN109074858A (zh) 2018-12-21
JP2019514128A (ja) 2019-05-30
WO2017182509A1 (fr) 2017-10-26
CN109074858B (zh) 2023-08-18
US20190147988A1 (en) 2019-05-16
JP6956107B2 (ja) 2021-10-27

Similar Documents

Publication Publication Date Title
US10818383B2 (en) Hospital matching of de-identified healthcare databases without obvious quasi-identifiers
US20190147988A1 (en) Hospital matching of de-identified healthcare databases without obvious quasi-identifiers
US20200265931A1 (en) Systems and methods for coding health records using weighted belief networks
JP5952835B2 (ja) 撮像プロトコルの更新及び/又はリコメンダ
US20170147753A1 (en) Method for searching for similar case of multi-dimensional health data and apparatus for the same
US11361020B2 (en) Systems and methods for storing and selectively retrieving de-identified medical images from a database
US20150142821A1 (en) Database system for analysis of longitudinal data sets
US20200013491A1 (en) Interoperable Record Matching Process
US20200251196A1 (en) Systems and methods for sorting findings to medical coders
US20210202111A1 (en) Method of classifying medical records
JP2022541588A (ja) 非構造化データを分析するためのディープラーニングアーキテクチャ
WO2017081580A1 (fr) Intégration et/ou ajout d'informations longitudinales à une base de données désidentifiée
US20230147366A1 (en) Systems and methods for data normalization
CN109522331B (zh) 以个人为中心的区域化多维度健康数据处理方法及介质
US20180268925A1 (en) Method for integrating diagnostic data
Mannino et al. Development and evaluation of a similarity measure for medical event sequences
WO2022101928A1 (fr) Procédé d'amélioration de la documentation clinique à l'aide d'un graphe des connaissances
Lequertier et al. Predicting length of stay with administrative data from acute and emergency care: an embedding approach
EP3654339A1 (fr) Procédé de classification d'enregistrements médicaux
Yee et al. Big data: Its implications on healthcare and future steps
Dilli Babu et al. Improved Algorithm for Proficient Storing and Retrieving of Medical Data Records in A Data Lake
Candia-Véjar et al. ML models for severity classification and length-of-stay forecasting in emergency units
PALANISAMY et al. MEDI-NET: CLOUD-BASED FRAMEWORK FOR MEDICAL DATA RETRIEVAL SYSTEM USING DEEP LEARNING
Foudeh et al. Information extraction from handwritten medical records and assigning ICD-10 codes
CN116434902A (zh) 数据检索方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20181119

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: KONINKLIJKE PHILIPS N.V.

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20210331

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20211001