WO2023056547A1 - Data governance system and method - Google Patents

Data governance system and method Download PDF

Info

Publication number
WO2023056547A1
WO2023056547A1 PCT/CA2022/051436 CA2022051436W WO2023056547A1 WO 2023056547 A1 WO2023056547 A1 WO 2023056547A1 CA 2022051436 W CA2022051436 W CA 2022051436W WO 2023056547 A1 WO2023056547 A1 WO 2023056547A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
component values
dataset
data component
key
Prior art date
Application number
PCT/CA2022/051436
Other languages
French (fr)
Inventor
Kenneth William Scott Morrison
Alex POPOV
Nikita RYNKEVICH
Keith Elliston
Michael Douglas Anthony WILLIAMS
Original Assignee
Fuseforward Technology Solutions Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuseforward Technology Solutions Limited filed Critical Fuseforward Technology Solutions Limited
Publication of WO2023056547A1 publication Critical patent/WO2023056547A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2141Access rights, e.g. capability lists, access control lists, access tables, access matrices

Definitions

  • the present disclosure relates to scalable, secure and policy-compliant distributed data storage systems, and, in particular, to a privacy-preserving data governance system and method.
  • Data de-identification is one technique that helps overcome legislative hurdles. Algorithms exist to anonymise a dataset so that it can be distributed to researchers without triggering a privacy violation. This might include techniques such as removing patient names, which are considered primary identifiers. However, quasi-identifiers also commonly exist in data records. These can often be used alone or in combination with other identifiers to re-identify an individual. For example, a simple record containing a birthdate and US zip code can be tied to an individual person over 50% of the time by correlating with public databases. Therefore, quasi-identifiers should be transformed to make them more ambiguous by lowering their fidelity. Substituting age for a birthdate is a common example that in most cases does not overtly degrade other underlying statistical relationships in the data.
  • Some organisations might generate synthetic data using capabilities in a generalised health data platform.
  • certain products include features to generate synthetic data from real data so researchers can export and share a safe (though synthetic) dataset with collaborators.
  • these products are designed as a generalised data analysis platform; they do not govern distribution of synthetic data files or manage an orderly transition of a researcher through all steps of the research lifecycle from development through to validation.
  • a privacy-preserving data management system for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, the system comprising a plurality of network-accessible hardware storage resources, each of the hardware storage resources being in network communication and configured for distributed storage of a source dataset, the source dataset comprising a plurality of source data objects each comprising constituent genuine data component values that are associated with a corresponding data subject.
  • the system further comprises a digital data processor for receiving and responding to the data request, the digital data processor being communicatively linked to a network via a communication bus, the digital data processor configured to generate a plurality of synthetic data component values preserving, at least in part, one or more relationships between the genuine data component values amongst at least some of the plurality of source data objects, store the plurality of synthetic data component values, and, in response to the data request, the data request having a permitted access privilege specific to a particular data usage instance associated therewith, generate a context-specific dataset, wherein the context-specific dataset corresponds at least in part to the plurality of source data objects and comprises at least some synthetic data component values depending on, at least in part, the permitted access privilege.
  • the synthetic data component values are generated using a generative model.
  • the digital processor is further configured to generate de- identified data component values corresponding to at least some of the genuine data component values, and replace in the context-specific dataset at least some of the genuine data component values with the corresponding the de-identified data component values depending on, at least in part, the permitted access privilege.
  • replacing some of the genuine data component values with the corresponding the de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on the permitted access privilege.
  • the context-specific dataset comprises at least some of the genuine data component values of the source dataset.
  • the context-specific dataset is generated before the data request is received.
  • the permitted access privilege is based on an estimated likelihood a given data object in the context-specific data set can be associated with one of the identifiable data subjects.
  • the permitted access privilege is based on one or more access permissions associated with the data requestor.
  • the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on the permitted access privilege.
  • each of the network-accessible hardware storage resources further comprises a key -value store configured to store a unique key -value logical row for each of the data objects.
  • each key-value logical row comprises a key, a metadata descriptor, and a data object identifier.
  • the key-value logical row comprises at least one of authorization information, data sensitivity information, or timestamp information.
  • the key-value logical row comprises a key-value logical row access authorisation value for restricting access to the corresponding key -value logical row, the authorisation value based at least in part on the permitted access privilege.
  • a computer-implemented privacy-preserving data management method for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, the method implemented on a data management system comprising a digital processor for receiving the data request and a plurality of network-accessible hardware storage resources each being in network communication and configured for distributed storage of a source dataset comprising a plurality of source data objects, each of the plurality of source data objects comprising constituent genuine data component values that are associated with a corresponding data subject.
  • the method comprises generating a plurality of synthetic data component values at least in part preserving one or more relationships between the genuine data component values amongst at least some of the source data objects, storing the plurality of synthetic data component values, and, in response to the data request, the data request having a permitted access privilege specific to a particular data usage instance associated therewith, generating a context-specific dataset.
  • the context-specific dataset corresponds at least in part to the plurality of source data objects and comprises at least some the synthetic data component values depending on, at least in part, the permitted access privilege.
  • the synthetic data component values are generated using a generative model.
  • the method further comprises generating de-identified data component values corresponding to at least some of the genuine data component values, and replacing in the context-specific dataset at least some of the genuine data component values with the corresponding the de-identified data component values depending on, at least in part, the permitted access privilege.
  • replacing some of the genuine data component values with the corresponding the de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on the permitted access privilege.
  • the context-specific dataset is generated to comprise at least some of the genuine data component values from the source dataset.
  • generating the context-specific dataset is done before the data request is received.
  • the access privilege is based on an estimated likelihood a given data object in the context-specific data set can be associated with one of the identifiable data subjects.
  • the permitted access privilege is based on one or more access permissions associated with the data requestor.
  • the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on the permitted access privilege.
  • the access privilege comprises a risk acceptability threshold (RAT).
  • a computer-readable medium having stored thereon instructions for execution by a computing device for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, the computing device being in network communication with a plurality of network-accessible hardware storage resources each being in network communication and configured for distributed storage of a source dataset comprising a plurality of source data objects, each of the source data objects comprising constituent genuine data component values that are associated with a corresponding data subject.
  • the instructions are executable to automatically implement the steps of generating a plurality of synthetic data component values at least in part preserving, one or more relationships between the genuine data component values amongst at least some of the source data objects, storing the plurality of synthetic data component values, and, in response to the data request, the data request having permitted access privilege specific to a particular data usage instance associated therewith, generating a context-specific dataset.
  • the context-specific dataset corresponds at least in part to the plurality of source data objects and comprises at least some of the synthetic data component values based at least in part on the permitted access privilege.
  • the synthetic data component values are generated using a generative model.
  • the steps further comprise generating de-identified data component values corresponding to at least some of the genuine data component values, and storing the de-identified data component values, wherein the context-specific data is generated to further comprise at least some the de-identified data component values depending on, at least in part, the permitted access privilege.
  • replacing some of the genuine data component values with the corresponding the de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on the permitted access privilege.
  • the context-specific dataset is further generated to comprise at least some of the genuine data component values from the source dataset.
  • generating the context-specific dataset is done before the data request being received.
  • the permitted access privilege is based on an estimated likelihood a given data object in the context-specific data set can be associated with one of the identifiable data subject.
  • the permitted access privilege is based on one or more access permissions associated with the data requestor.
  • the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on the permitted access privilege.
  • the permitted access privilege is a risk acceptability threshold (RAT).
  • RAT risk acceptability threshold
  • Figure 1 is a schematic diagram illustrating a privacy-preserving data access system, in accordance with one embodiment
  • Figures 2A and 2B are schematic diagrams illustrating a privacy-preserving data access method using the system of Figure 1, in accordance with one embodiment
  • Figures 3 and 4 are schematic diagrams illustrating exemplary use cases of the method of Figures 2A and 2B, in accordance with different embodiments.
  • Figure 5 is a schematic diagram illustrating exemplary machine learning approaches, in accordance with various embodiments.
  • elements may be described as “configured to” perform one or more functions or “configured for” such functions.
  • an element that is configured to perform or configured for performing a function is enabled to perform the function, or is suitable for performing the function, or is adapted to perform the function, or is operable to perform the function, or is otherwise capable of performing the function.
  • the systems and methods described herein provide, in accordance with different embodiments, different examples a data governance platform that provides the ability to create different views of data comprising interleaved synthetic data and de- identified data that are derived from the actual or live data, as a function of the contextual requirements relating to the data and/or data consumer.
  • Such requirements often include privacy compliance but may also include other administrative, analytics, management, or use-related requirements.
  • de-identification of data is well established technology. Nevertheless, it is easy to do wrong, with often catastrophic results for patient privacy.
  • de- identified datasets are generated as separate, distinct files that are managed separately from the original real data from which they were derived.
  • Synthetic data generation is a maturing technology that has made its way from academic journals to trade engineering publications. There are also dramatic examples of the use of synthetic data on the Internet showing its application to various datasets, including, for instance, human faces.
  • synthetic data may be broadly understood as relating to ‘artificial’ data that is generated from real data, while maintaining or replicating underlying statistics of the real data.
  • Figure 5 shows an exemplary taxonomy of generative models 500 that may be employed to this end, in accordance with different embodiments.
  • generative models 500 may relate to explicit density models 502, which may in turn relate to tractable density models 504 or approximate density models 506.
  • tractable density models 504 often relate to fully visible belief nets 508 (e.g. NADE, MADE, PixelRNN models, or the like)
  • approximate density models 506 may comprise variational models 510 (e.g. variational autoencoders 512, or VAEs 512) or Markov chain models 514 (e.g. Boltzmann machines 516).
  • Implicit density models 518 may comprise direct generative models 520, such as generative adversarial networks (GANs) 522, or implicit density Markov chain models 524 (e.g. a GSN 526). While various embodiments herein described relate to the generation of synthetic data using GANs 522 or VAEs 512, it will be appreciated that various other embodiments relate to the generation of synthetic data using alternative machine learning (ML) or generative models 500, nonlimiting examples of which are schematically shown in Figure 5. In accordance with yet other embodiments, synthetic data may be generated in accordance with or by a deep learning platform.
  • GANs generative adversarial networks
  • GSN 526 implicit density Markov chain models
  • a generative model such as a GAN may replicate the underlying statistics of real data in an artificial dataset
  • such techniques generate data as a file that may be distributed using conventional means, such as FTP, email, a shared file system, or the like.
  • governance e.g. the control of data throughout its lifecycle by an administrative authority
  • access control to ensure that only the authorised party can access data, control copies of data, audit data usage, delete data after it is no longer needed, show data provenance of file versioning, or the like.
  • a data administrator can easily apply a GAN on a real dataset to create a new, synthetic dataset.
  • the result would likely be a simple file of data that mirrors the structure of the original, but contains no actual records from real people. But once this is given to a researcher, there is a loss of context and control.
  • the synthetic data passes from control of the administrator, to control of the researcher. This handoff is the source of a number of problems, despite the fact that the data contains no actual sensitive personal information.
  • the first set of problems relates to the governance of the synthetic data.
  • Synthetic data is typically disconnected from an original dataset, despite it being derived from the former (or from a closely related dataset). The loss of this relationship creates a number of problems. Orphaned synthetic datasets have no provenance. Provenance is very important, as researchers need to be confident that there is a well-documented path of transformations, queries and filters taken by the dataset they are working on. This could include multiple steps, from original acquisition, to cleanup (aka Data Wrangling), filtering, j oining with other datasets, de-identification, synthetic data generation, or the like.
  • Provenance is very important because it documents a chain of steps that should be reproducible should any questions arise around data integrity in the course of research or after publication of results. It is also common for publications to require publication of datasets (subject to privacy issues), and it is crucial that a researcher have confidence in their ability to reproduce resulting data whenever this is mandated.
  • a synthetic data file may also have nothing to identify it as synthetic data, which could inadvertently trigger a HIPAA or GDPR investigation if it is stolen. There may be no way to version the files and roll back if an error is found.
  • the researcher’s use of a casually distributed synthetic dataset is not subject to any kind of audit. They could make infinite copies and distribute these however they like. There is no way to cut off a researcher’s access if the relationship breaks down, as the data may reside on backups outside of the control of the data administrator. There is no way to delete or pull back a dataset once it has been distributed.
  • the systems, devices and methods described herein implement the abovementioned data governance of real, de- identified and synthetic datasets, and promotes the relationship therebetween. They enable a binding relationship between real, de-identified, and synthetic data, as well as other related and derived works. This relationship can be used to better govern these datasets in a way that promotes use of data without violating security and privacy requirements. For example, a synthetic data file of cholesterol tests, and a real dataset from which it is derived (which would demand special handling because it contains PHI), have direct relationship, even though they contain different data. There is a morphologic similarity (e.g.
  • a system or method may provide views of either a synthetic dataset, a real dataset, or hybrid combination of the two, to an authorised researcher. These views can be managed by a data owner or administrator, who is responsible for guiding a researcher to use data that are appropriate for where they are in their research timeline. These views may reflect a researcher’s immediate relationship with the data owners.
  • system 100 is directed to providing data governance or management capabilities over the lifecycle of real, de-identified and/or synthetic data together as related elements. This provides for the management of sensitive data, balanced with accessibility for researchers to promote valid use of the data.
  • system or platform 100 generally comprises a computing device 101, the device comprising at least one digital data processor 104 communicatively linked to accessible memory 106 and a communication bus 108, the communication bus 108 itself configured to be in network communication with a data requestor 110 and a plurality of remote independent network-accessible hardware storage resources 112.
  • each of the hardware storage resources 112 has stored thereon at least one dataset.
  • these datasets will include a plurality of data objects, each having corresponding constituent data element values.
  • this may include a source dataset 114 comprising source data objects 118, each source data object 118 comprising genuine data component values 120 which correspond to or are associated with a corresponding identifiable data subject 122 (i.e. name, address, postal code, etc.).
  • these genuine data component values 120 may also include privacysensitive information (e.g.
  • PHI protected health information
  • birth dates such as birth dates, medical test results, medical images, or the like, which may comprise “raw” or processed values
  • other types of privacy-sensitive information or data e.g. financial information, or the like.
  • access to a source dataset 114 will be under the supervision or administration of a data owner or administrator 116, which has full control of the parameters under which the source dataset 114 may be accessed by the data requestor 110 via system 100.
  • system 100 allows data owners 116 to find their own balance between protecting sensitive data and promoting research interests via increased access to data.
  • the digital data processor 104 may be configured to respond to data storage requests received over a network and relating to the data objects 118.
  • the network communications interface communicatively interfaces one or more requesting users (e.g. data requestor 110) and a key -value stored on one or more of the hardware storage resources 112.
  • a key-value store configured to store a unique key -value logical row for each constituent data object component of each data object may comprise, in accordance with some embodiments, a key, a constituent data component value, and a metadata descriptor.
  • At least one of the key -value logical rows for a given data object may be directly associated with source data and at least one of the keyvalue logical rows of the given data object is derived from one or more other key -value logical rows.
  • the digital data processor 104 may generate an independent dataset from the key-value store by accessing those key-value logical rows having metadata descriptors responsive to the data access request.
  • data owner 116 may take the form of any individual and/or private or public organisation (companies, administrative bodies, governmental agencies, etc.) which has ownership of source dataset 114. Generally, data owner 116, via system 100, has full control of the access permissions or entitlements given to data requestor 110. In some embodiments, those access permissions or entitlements may be determined or allocated for a given source data usage instance (e.g. a data request from a data requestor 110 for a data object related to an identifiable data subject 122). For example, a particular data requestor 110 may be assigned an access privilege based on, for instance, the level of trust that the data owner 116 has in the requestor 110.
  • a given source data usage instance e.g. a data request from a data requestor 110 for a data object related to an identifiable data subject 122).
  • a particular data requestor 110 may be assigned an access privilege based on, for instance, the level of trust that the data owner 116 has in the requestor 110.
  • Such an access privilege may be based on, for instance, a position or role of the requestor (e.g. doctor, hospital administrator, financial auditor, or the like), a level of trust that has been otherwise established between the data owner 116 and the requestor 110, and/or a perceived or quantifiable likelihood that data access for a particular requestor may lead to reidentification of any data with an identifiable data subject 122. Additionally, or alternatively, an access permission may be assigned to a source data usage instance based on, for instance, the stage of a research lifecycle associated with a particular data usage instance.
  • a source data usage instance may comprise a query or other form of data request, or a plurality thereof.
  • a source data usage instance may comprise multiple simultaneous data requests under the same source data usage instance.
  • a data usage instance may comprise multiple subsequent data requests associated with, for instance, a particular stage of a research lifecycle.
  • a data requestor 110 may, in accordance with some embodiments, request access to data to evaluate and/or train a health science model. Upon determination of a result from a first iteration of their model, the requestor 110 may then adjust model parameters, and again request access to data to test their updated model. This process may, in accordance with some embodiments, be repeated within the same source data usage instance.
  • Such requests may be constrained by the same access permissions.
  • data requestor 110 and data owner 116 may negotiate for different access privileges or permissions, upon which a subsequent source data usage instance may be associated these new privileges or permissions.
  • these access privileges, permissions, or entitlements may accordingly be based at least in part on a trust threshold associated with the requestor 110, or a stage of research associated with the source data usage instance.
  • an access privilege associated with a source data usage instance may, additionally or alternatively, relate to an estimated likelihood that a given data object accessed in response to a request may be associated with one of the identifiable data subjects, as further described below. Accordingly, an access privilege may, in accordance with some embodiments, relate to a degree of obfuscation (i.e. de-identification) or generation of synthetic data corresponding to genuine data accessed in response to a data request. Such a likelihood of re-identification may be based on any one or more privacy preserving processes, a non-limiting example of which may include a differential privacy process.
  • a given source data object 118 of a source dataset 114 such as, for example, a patient record, or indeed a patient
  • at least some if not all of the available data may be ingested as individual discrete portions of data, along with a metadata descriptor of each portion.
  • a key may be associated with the entry for, in part, future identification.
  • a key-value store may comprise logical rows, wherein each logical row comprises an individual portion of the source data, a constituent data component value (the “value”), an identifier (the “key”), a metadata descriptor, a data object identifier, and optionally, in accordance with different embodiments, additional management information, such as authorisation, sensitivity, or other compliance and/or timestamp information.
  • the key -value store that comprises logical rows, wherein each logical row comprises a constituent data component value and a key identifier may also be referred to as the key-value pair.
  • the collection of all logical rows for a given data object may comprise the digital asset, which may also include the source data. However, in many embodiments, there may be a logical row associated with the source data; e.g. a patient record in a text file or PDF format.
  • a data object may, in some embodiments, be considered to broader than the data asset, and may refer to all information, whether existing or potential, regarding any entity, such as a patient, hospital, doctor, bank, transaction, etc.
  • a first logical row may consist of an object ID relating to the patient, a unique identifier (the key), a metadata descriptor of “source data”, and a value corresponding to the patient record data file itself. From the source data file, additional logical rows may be created for every discrete portion of source data.
  • Additional logical rows can then be derived from the existing logical rows, as well other applicable information.
  • derived logical rows corresponding to existing logical rows can be generated that aggregate or obfuscate existing logical rows.
  • any of the existing logical rows either imported (i.e. ingested) or derived (i.e. curated), can be provided along with - or excluded from - access requests associated with the derived logical row.
  • the value-portion of a given logical row may be the actual value (“raw” data, images, or the like), or it may be a reference, direct or indirect, to the value and/or storage location of the value.
  • a key-value store may be employed for granular governance and flexible curation of digital assets.
  • Embodiments hereof can receive unstructured or structured data as an input.
  • the input data could be acquired from a patient record, a financial record or other type of record and can come in several formats such as PDF, CSV or other types of electronic or non-electronic inputs.
  • a key-value store is a data storage structure designed for storing, retrieving, and managing associative arrays, which contains a collection of objects or records, which in turn has different fields within them, each containing data.
  • the data included in a data collection will have related attributes so that the data can be stored, retrieved and managed in an efficient manner and this data collection can be derived, generated or calculated after or during curation.
  • These records are stored and retrieved using a key identifier that uniquely identifies the record, and is used to quickly find data within a database.
  • a key identifier that uniquely identifies the record, and is used to quickly find data within a database.
  • disclosed implementations of the key-value store allow, as will be discussed below, the generation of context-specific datasets that are generated from the key-value store itself (keeping in mind that, in some embodiments, the “value” portion of a logical row can be the associated piece of data, or a reference thereto).
  • Such generated datasets may be based on further utilisation of additional descriptors and indicators, depending on the data access request.
  • the source dataset 114 may comprise any type of source data in various formats, including PDF files, text files, CSV, database information, images, or spreadsheet documents, that is extracted and stored as a data object comprising a keyvalue logical row, which comprises at least constituent data component values and the associated metadata descriptors.
  • the data object is associated with source dataset 114, as well as all other logical rows that have been or may be created. Multiple and separate records relating to a data object, e.g. a patient, may constitute an example where a data object may be associated with more than one source dataset.
  • metadata of the source data may be collected, derived, or formulated and are stored as key-value logical rows, with its unique key, constituent data component values, and associated metadata descriptor.
  • the metadata associated with a given logical row is a type of data that describes and gives information about the data to which the logical row pertains.
  • the metadata could be “raw data”, “file type”, “patient ID”, “name”, with the value associated therewith, as extracted from the source data or a derived from other data, stored in the same logical row.
  • Each collected, derived, or formulated key-value entry is stored in the key -value data store as a key -value logical row, the rows collectively forming a data asset or a portion thereof.
  • metadata include the name of the file, the type of the file, the time the file was stored, the source (e.g. raw) data itself, and the information regarding who stored the file.
  • the collected information may be parsed and saved in a key-value store as a key-value logical row with its respective key for unique identification, constituent data component value, and metadata descriptors.
  • the source data may be parsed for acquisition of metadata.
  • the acquired metadata are stored in the key-value store with respective key for unique identification, constituent data component value, and metadata descriptors.
  • the metadata preliminarily derived may be saved as key -value logical rows in the key -value store, wherein key -value logical rows may collectively form a data object associated with a source data.
  • First name, last name, type of disease, date of financial transaction, and age are non-limiting examples of acquired data.
  • derived metadata may be derived from other logical rows, including either source data, acquired data from the source data, or other derived data.
  • the metadata associated with derived logical rows are stored in the key-value store as part of the key-value logical rows with the logical row unique identifier (such unique identifier being a unique key), a data object identifier, and constituent data component value.
  • system 100 may be further configured such that metadata may be employed to formulate and output a context- and/or requestor-specific dataset.
  • a dataset may be generated from a key-value store by accessing only obfuscated logical rows, as well as other lower-sensitivity (or other access criteria); accordingly, a derived dataset that is separate from the source data, or even the key-value store data is specifically produced for a certain context - and that context may be determined or created by generating specific types of logical rows based on predetermined metadata.
  • Another example may include a patient dataset where a derived logical row includes an age range, or first three digits of a postal code, and the resulting derived dataset is generated by accessing all non-identifying information regarding disease types and outcomes for a group of patients along with the aforementioned derived logical row. Without providing access to the source data, an analysis of the dataset can be performed, wherein disease frequency by age or location can be assessed without giving any direct access to sensitive information.
  • dataset creation can be dynamic and compliant irrespective of the type of information stored regarding data objects.
  • the use of a key-value store paradigm may be used to provide granular access control to the data.
  • the use of a keyvalue store such as Accumulo, provides cell-level security with a visibility field in the key.
  • the use of a key -value store paradigm is a data model that stores source data in a key -value pair and metadata values in the same logical row as additional key -value pairs.
  • the column visibility field is used to store data attributes related to governance or compliance rules specified by the user and/or data owner.
  • a constituent data component value may comprise stored digital information directly, or point to a location in storage where the digital information is stored.
  • the metadata descriptors may be formed in response to the data access request.
  • the data access request would comprise of pre-determined metadata descriptors and new metadata descriptors either by system administrator or end-user (i.e. request for a specific use and/or context).
  • the pre-determined metadata descriptors are the result of processing the source data; these functions are sometimes referred to as data processing functions (DPF). Each data processing function may be associated with a specific timestamp or version for all of the components that result from the processing.
  • This associated timestamp may be included in the key-value store, and may be similar to a version control feature.
  • this version control feature can allow for version roll back to a previous processed state and/or specific application of rules or data management of a processed dataset.
  • Such timestamps can provide a mechanism to assess how a dataset changed over time as the state of the dataset can be assessed as it was at any point in time.
  • the data can be accessed directly through an application programming interface (API), which can be a set of routines, protocols, and/or tools for building software applications.
  • API application programming interface
  • These direct access requests may occur through a library call for programmatic access in data science or a call through a representational state transfer (REST) API when accessing the data for an application.
  • REST representational state transfer
  • a query using these examples of direct data access may trigger a distributed routine to collect the data across various nodes.
  • the data may be accessed through a manufactured dataset, and may use the distributed computing capability of various software tools (Accumulo, Spark, or the like) on the cluster to create batch jobs that use metadata descriptors to assemble the necessary dataset and to generate the dataset into the format requested.
  • this dataset may be exported to a specified location to meet governance, privacy and/or compliance requirements.
  • the process of authorisation regarding data access requests may be simplified for the administration through the use of tags, attributes and, expressions, which may provide administrators with the ability to specify tags, attributes or expressions on the data at a high level.
  • tags, attributes and, expressions which may provide administrators with the ability to specify tags, attributes or expressions on the data at a high level.
  • using the Accumulo software will provide users with a visibility field that allows the use of arbitrary attributes such as PHI, PUBLIC and, DEIDENTIFIED, which can then be assigned to users/groups for authorisation.
  • directory servers such as Microsoft’s Active Directory (AD) may enable a user to link users/groups to authorisations.
  • a customer may define a rule to a group called “researchers” in a specified AD location, such as “researcher authorisation allows you to see data with attributes PUBLIC and DE-IDENTIFIED”.
  • the Accumulo infrastructure allows user attributes identified for users/groups to be defined and used in the same way; this attribute-based access control would authorise users/groups/ AD with particular attributes to access data with particular attributes.
  • authorisation decisions may be made via a policy engine, such as the open-source Open Policy Agent (OP A) engine, or similar.
  • directories may be used only as sources of information or data, and/or as an authentication server (i.e. for validating usemame/password combinations).
  • the policy engine when combined with, for example, the Accumulo infrastructure, may be configured to authorise access by assessing a pre-defined policy (e.g. a rule set) based on attributes of the data requestor making the request (i.e. name, group memberships, security clearance, location, etc.), and metadata associated with the data to be accessed (i.e. sensitivity, data owner’s name, time of acquisition, etc.).
  • a pre-defined policy e.g. a rule set
  • the employment of a key -value store permits the storage and operation on at least four types of data, collected or derived, when a source data is received or exists in the key -value store: metadata descriptive of the source data (e.g. the source data file itself, file name, file type, file size, etc.), metadata derived from the source data (e.g. patient name data from a the corresponding patient name field within the source data file); metadata derived from the preliminarily derived metadata (e.g. a predetermined category, such as age group where the value for such derived logical row is determined from another existing logical row where the metadata descriptor is age); and governance metadata (e.g. retention policies, authorisation, owner, etc.).
  • metadata descriptive of the source data e.g. the source data file itself, file name, file type, file size, etc.
  • metadata derived from the source data e.g. patient name data from a the corresponding patient name field within the source data file
  • metadata derived from the preliminarily derived metadata
  • the metadata derived from the source data may be referred to as the tokenisation of the original data; this refers to any operation to data associated with a data object, including other logical data rows, in order to protect, analyze, or generate new data from the existing source data or generated data at a granular level.
  • This tokenisation can include obfuscation, aggregation, computation, and the application of filters.
  • the keyvalue store therefore allows formulation of datasets and access thereto based on context- and requestor-specific characteristics.
  • Each key -value logical row may be assigned a unique key for identification.
  • all key -value logical row associated to a given set of source data may be assigned a unique key for identification.
  • all key-value logical rows associated with a data object may be assigned a unique key for identification.
  • it may assign a unique key identifier, grouping the metadata associated to the source data as a single logical entity, or grouping the metadata associated to a data object associated to at least one source data as a single logical entity.
  • examples of the metadata descriptors for each collected or derived datum include an accessibility authorisation and/or sensitivity descriptor and time-sequenced information, temporal-/locality-based associations.
  • key values can be used for, among other reasons, identifying, locating, and securing access to data objects, data can be indexed and accessed based on the existence of certain metadata, (1) data can be quickly accessed and located based on the existence of specified metadata within the key-value store; (2) derived datasets can be generated directly from the key-value stored; and (3) regulatory and administrative compliance can be enforced at a data storage layer (as opposed to at an application layer).
  • the plurality of data storage resources 112 exists in a network communication and are configured for distributed storage of a plurality of data objects, wherein each said data object comprises of a plurality of constituent data object components.
  • An example of the plurality of data objects includes a set of data related to or derived from either unstructured or structured data received by the system as an input.
  • a constituent data object component includes each set of data that forms a part of the data object, and may be generated automatically, derived under system command, or formulated based on unique requests.
  • one or more digital processors 104 have a data object key value store accessible thereto, wherein the data object key value store stores a unique key -value logical row for each constituent data object component.
  • each constituent data object component is stored in the data object key value store, as a unique key-value logical row.
  • each key-value logical row comprises: a key for uniquely identifying the key -value logical row; a constituent data object component value for providing component information relating to the constituent data object component associated with the key -value logical row; and a meta data descriptor for describing a data object component characteristic of the constituent data object component value.
  • An example of a key for uniquely identifying the key-value logical row includes a unique identifier for all the data generated, derived, or formulated from an input received.
  • An example of the constituent data object component value may entail actual values for a given constituent data object component, where an example of a metadata descriptor may include, for instance, names and age.
  • the system may derive at least one of the constituent data object components.
  • the system may further employ at least one of the constituent data object component values and derive at least one constituent data object component.
  • the system may preliminarily derive constituent data object components.
  • the system may further derive other constituent data object components. This operation may be performed by the system upon requests to the processing component, wherein the request triggers access to constituent data object component values comprising metadata descriptors.
  • each key-value logical row embeds additional management information, such as an access authorisation value for restricting access to the constituent data object component values, in response to requests associated with a corresponding authorisation.
  • This access authorisation value can also be a sensitivity tag or other compliance and/or governance information and/or timestamp information.
  • the access authorisation value or sensitivity tag can correspond with a user identity, user role and/or a user group, restricting access to the constituent data object component values.
  • constituent data objects may include restricting access to patient records, financial data, or proprietary, confidential or sensitive data.
  • user roles, user identity, or user groups may include doctors, researchers, banks, administrators, and underwriters.
  • the restriction of the constituent data object component values will be based on governance and/or compliance rules, such as data retention, storage requirements, and data ownership.
  • rules associated with timestamp information or version control information can be used to restrict access to the constituent data objects. Some examples of using timestamp information may include restricting access to the most recent version of constituent data objects, or limiting access to older versions of constituent data objects.
  • At least one of the constituent data object components for a given key-value logical row are derived from the input source data automatically upon storing the source data associated with the data object in the data storage components.
  • the derived datasets may be associated with a set of pre-determined rules, or data processing functions (DPF), which can be used to produce metadata descriptors to the source data or to add timestamp information or version control.
  • DPF data processing functions
  • the derivation may take place under pre-determined requests, under data access requests, or by system administrator, both at run time or at subsequent times.
  • these rules can be created during ingestion of the data or after the data was already ingested.
  • these data processing functions are developed using a general-purpose programming framework, such as Spark and/or MapReduce, which enables curation functions to be run across the data constituent data objects.
  • a source dataset 114 may include different formats of documents that may be provided to the data storage system.
  • a context-specific dataset may be generated based on the source dataset 114, in accordance with specific requisitions made of the data storage system.
  • the plurality of network-accessible hardware storage resources is in network communication and configured for distributed storage of data objects.
  • the data objects may include any type of data obtained, derived, formulated, and/or related to the source data itself upon the receipt of the source data by the data storage system.
  • the digital data processor responds to data access requests received over a network relating to the data objects.
  • Data access requests related to data objects stored in the data storage system may come from end-users.
  • the key-value store is stored in the hardware storage and may be composed of a unique key -value logical row for each constituent data component of each of the data objects in the data storage system.
  • a data storage system may contain a number of data objects, which may be composed of constituent data components related to a source dataset.
  • a set of data objects, or a data object may be related to a source dataset provided to the data storage system.
  • the data object may be composed of constituent data components that were received, derived, or formulated at the time of, or subsequent to the receipt of the source data at the data storage system.
  • These constituent data components may include various characteristics and/or information related to the source data itself, the data derived from the source data, or the data formulated from the source data or derived from the source data under given requisitions.
  • Each said unique key-value logical row may comprise a key for uniquely identifying the unique key -value logical row, a constituent data component value, and/or a metadata descriptor.
  • the key for unique identification of the unique key-value logical row may be a value comprising stored digital information.
  • the key may be formulated from the constituent data component associated with the key-value logical row and a metadata descriptor.
  • the key may be a combination or combinations of constituent data component values and metadata descriptors.
  • the constituent data component values may comprise stored digital information relating to the constituent data component associated with the unique keyvalue logical row. This digital information may be a value directly obtained, derived, or formulated from the source data received.
  • the digital information may store a value indicative of location of where the actual value is stored. Examples of the digital information include actual first name (e.g. John) and a pointer value to a designated location in a data storage.
  • the metadata descriptor may describe metadata of the constituent data component value. Metadata may generally comprise data information that provides information about other data. In some embodiments, metadata describes a resource for purposes such as discovery and identification, including elements such as title, abstract, author, and keywords.
  • metadata describes containers of data and indicates how compound objects are put together, non-limiting examples of which may include types, versions, relationships, and/or other characteristics of digital materials.
  • metadata provides information to help manage a resource, such as when and how it was created, a file type or other technical information, and/or who can access it.
  • At least one key-value logical row for a given data object is directly associated with the dataset, and at least one key -value logical row for the given data object is derived from one or more other key -value logical rows.
  • Examples of directly associated key-value logical rows may include the data obtained at the time of the receipt of the source data, such as file name and file type, and the data derived at the run time or at subsequent times, such as first name and last name.
  • key-value logical rows derived from one or more other key-value logical rows may be derived based on end-user requisitions.
  • the key-value logical row derived from one or more other key-value logical rows may be derived based on data administrator of the data storage system.
  • one or more digital data processors 104 in response to a given data access request based on a given metadata descriptor, generates an independent dataset via a key -value store by accessing those key -value logical rows having metadata descriptors responsive to the data access request.
  • the given metadata descriptor may be pre-determined by the system administrator or customised metadata by the end-user.
  • the metadata descriptors may include metadata descriptors created at the run time, at subsequent times when the key -value logical rows were derived or formulated, or when a requisition based on the given metadata descriptor is made.
  • a key-value logical row comprises an access authorisation value for restricting access to the corresponding key-value logical row.
  • the access authorisation value may be stored digital information.
  • the access authorisation value may be a combination or combinations of constituent data component values and metadata descriptors.
  • the access authorisation value may be employed for generation of the independent dataset, in response to a given data access request, allowing control over the information accessed and the independent dataset generation.
  • factors that may be associated with the access authorisation include a requesting user identity, a requesting user role, a requesting user group, the constituent data component of the corresponding key -value logical row, the source datasets from which the corresponding key-value logical row originated, and/or the metadata descriptor of the corresponding key-value logical row.
  • access authority may be determined at the time of the data access request, or at subsequent times, based on the above-noted factors.
  • At least some of the key-value logical rows are automatically generated from the source dataset upon importing such source dataset 114 into the data storage system.
  • Examples of the key-value logical rows that are automatically generated from the source dataset upon importing the source dataset into the data storage system include file name and file type.
  • some of the derived keyvalue logical rows are derived upon a request for such derivation by a user of the data storage system.
  • Examples of the key-value logical rows that are derived upon a request for such derivation by a user of the data storage system include first name and last name.
  • an additional key-value logical row is derived by obfuscating the constituent data component value of at least one existing key -value logical row to generate the constituent data component value of the additional key value logical row, and the corresponding metadata descriptor of the additional key -value logical row being generated based on such obfuscation.
  • obfuscation may include a deliberate obscuring of a birth date (e.g.
  • a birth date of April 11, 1988 is obfuscated to an age of 30 to 35 years old, or a birth year of 1988, or the like) so as to not disclose the precise age, but place other key -value logical rows related to the same source data as the mentioned age key -value, and make available for data access requisition with access authority a generated independent dataset.
  • an additional key-value logical row may be derived by aggregating the constituent data component values of at least two existing keyvalue logical rows to generate the constituent data component value of the additional keyvalue logical row, and the corresponding metadata descriptor of the additional key-value logical row being generated based on such aggregation.
  • Examples of aggregation include aggregating first name and last name to formulate an additional key -value logical row, with the corresponding metadata descriptor name.
  • examples of aggregating include aggregation of key -value logical rows related to a data object associated to a source data.
  • an additional key-value logical row may be derived through a function-based calculation based on the constituent data component values of at least one existing key-value logical row to generate the constituent data component value of the additional key -value logical row, and the corresponding metadata descriptor of the additional key -value logical row being generated based on said functionbased calculation.
  • said function-based calculation may include a decisionmaking scheme, mathematical function, or other rules to come up with additional key -value logical row (and the corresponding metadata) based on existing key-value logical rows.
  • additional key -value logical rows may be derived by obfuscating the constituent data component value of at least one existing key-value logical row to generate the constituent data component value of the additional key-value logical row, and the corresponding metadata descriptor of the additional key-value logical row, by obfuscation.
  • the access authorisation value of the additionally derived key-value logical row may be the same as the existing key-value logical rows, from which the additional key -value logical row was derived, or different.
  • the access authorisation for additional key-value logical rows may be pre-determined in association with one or more of the following: a requesting user identity, a user role of a requesting user group, the constituent data component of the corresponding key -value logical row, the source datasets from which the corresponding key -value logical row originated, and/or the metadata descriptor of the corresponding key -value logical row.
  • the access authorisation for additional key-value logical row may be determined by the system administrator or data access requestor.
  • each Hadoop-based cluster node in this exemplary system may comprise computing devices that may be classified as either or both Hadoop master nodes and Hadoop data nodes.
  • each manager node may comprise computing devices that may be classified as either or both Hadoop master nodes and Hadoop data nodes.
  • there may be one manager node, or a plurality thereof; in either case, the master node functionalities described below may be carried out by a single master node, or distributed in various manners across the plurality, and that the subject matter hereof is not limited to systems with two manager nodes.
  • a manager node may carry out the following functions: runs any centralised applications that manage the data storage and access functions (including management of the key-value store); provides the web and/or other (e.g. REST) interface for data administration, privacy, security, and governance functions; hosts the any web, proxy, or other server functionality (e.g. NGINX); manages and runs the master key distribution for administrators or other service principals (e.g. MIT Kerberos Key Distribution Center); runs data analysis or data function applications or libraries (e.g. the PHEMI Data Science Toolkit, Spark, and Zeppelin); manages and runs the slave key distribution for administrators or other service principals (e.g. MIT Kerberos Key Distribution Center); and hosts backup components for any other manager node in case of critical failure thereof.
  • any centralised applications that manage the data storage and access functions (including management of the key-value store); provides the web and/or other (e.g. REST) interface for data administration, privacy, security, and governance functions
  • hosts the any web, proxy, or other server functionality e
  • a data requestor or consumer 110 may include any individual or institution requesting access to source dataset 114. This may include, for example and without limitation, a data researcher desiring access to a source dataset 114 for research purposes, or the like.
  • the plurality of data storage resources 112 may be configured to leverage the capability of a key-value store.
  • Some embodiments may utilise one or more hardware storage devices 112, each of which may in turn comprise storage sub-elements (e.g. a server comprising a plurality of storage blades that each in turn comprise multiple storage elements of the same or different types, such as flash or disk).
  • Very large datasets may be distributed amongst many different local or remote storage elements; they may be closely stored (e.g. on the same device or on directly connected devices, such as different blades on the same server), or they may be highly disparately and remotely stored (e.g. on different, but networked, server clusters). Furthermore, the data stored may be duplicated for a number of reasons, including redundancy and failure handling, as well as efficiency (e.g. to store a copy of information that has been recently used "close” to other required data). Systems and methodologies for managing such large data and complex sets have been developed (e.g. HDFS for HadoopTM).
  • each of said storage resources 110 are generally configured for storing, accessing, and using very large datasets using a key-value store to ingest and store data from a source dataset 114 (e.g. a patient or financial record) in a highly granular fashion.
  • a source dataset 114 e.g. a patient or financial record
  • the data can be accessed directly through an application programming interface (API), which can be a set of routines, protocols and, tools for building software applications.
  • API application programming interface
  • These direct access requests may occur through a library call for programmatic access in data science or a call through a representational state transfer (REST) API when accessing the data for an application.
  • REST representational state transfer
  • a query using these examples of direct data access may trigger a distributed routine to collect the data across various nodes.
  • the data may be access through a manufactured dataset and use the distributed compute capability of software tools, such as Accumulo and/or Spark, on the cluster to create batch jobs that use metadata descriptors to assemble the necessary dataset and to generate said dataset into the format requested.
  • this dataset may be exported to a specified location to meet governance, privacy and/or compliance requirements.
  • requestor 110 may be given, instead of raw source data, a view of or access to a context-specific dataset 206, wherein the context-specific dataset 206 is generated (or pre-generated) based on a permitted access permission 208, such as a trust factor associated with the requestor and/or a likelihood of re-identification 210 of the constituent data element values of the data objects contained in the context-specific dataset 206.
  • a permitted access permission 208 such as a trust factor associated with the requestor and/or a likelihood of re-identification 210 of the constituent data element values of the data objects contained in the context-specific dataset 206.
  • context-specific dataset 206 comprises one or more data objects (e.g. context-specific data objects 220) that are representative, at least in part, of a source dataset 114, but wherein the constituent data element values therein that are restricted, removed, replaced or obfuscated based on a permitted access privilege (e.g. re-identification risk value 208 specific to this particular data usage instance of data requestor 110 characterising the trust level, and/or an estimated likelihood of re-identification 210 that a given accessed data object can be associated with one of the identifiable data subjects 122).
  • a permitted access privilege e.g. re-identification risk value 208 specific to this particular data usage instance of data requestor 110 characterising the trust level, and/or an estimated likelihood of re-identification 210 that a given accessed data object can be associated with one of the identifiable data subjects 122).
  • each data object therein e.g.
  • context-specific data object 220 may comprise a combination or interleaving 222 of different types of data component values, including genuine data component values 120 originally included in source dataset 114, but also de- identified data component values 260 and/or synthetic data component values 270, both of which being, at least in part, derived from said genuine data component values 120, as will be discussed below.
  • the relative number e.g. a designated number, amount, fraction, type, quantity, or the like
  • genuine data component values 120, de- identified data component values 260 and/or synthetic data component values 270 in said context-specific dataset 206 may have an influence on its associated estimated likelihood of re-identification 210.
  • source data component values may only be accessible to certain requestors 110, and/or roles or re-identification risk factors 208 thereof.
  • medical images may only be accessible to a doctor (rather than, for instance, a hospital administrator), or a data scientist with a designated reidentification factor and/or who is requesting data in the final stages of a research lifecycle.
  • a data usage instance refers to the circumstances, requirements, restrictions, and characteristics associated with a use or analysis of a contextspecific dataset (and/or the underlying source data). While the data analysis task and even the underlying source data and/or context-specific dataset may remain similar from one institution to the next, the data usage instance for each may be very different. For example, different privacy requirements may apply due to different legislative, regulatory, or policy requirements; different data sensitivity may apply for certain groups of individuals in different institutions; different research ethics boards may impose different requirements over the analysis or use of the data; different persons or organizations may be carrying out the data analysis or use in respect of which different data usage controls or restrictions may be in place.
  • a data usage instance including an actual and permitted risk of re-identification associated therewith.
  • the foregoing examples are provided so as to illustrate different possible circumstances that would give rise to a particular data usage instance; different factors may apply (and the foregoing may not) so as to give rise to such a particular data usage instance.
  • a data usage instance may refer to a single data request or, to the extent that a similar set of applicable circumstances apply, a plurality of data requests.
  • a data subjects may refer to person, place, thing, or set of conditions and/or characteristics to which a data object applies.
  • it may refer to an individual (e.g. a patient, customer, insured individual, bank customer); a legal person or association (e.g. a business, corporation, individual, joint venture, etc.); a set of circumstances (e.g. weather conditions at a particular time and place); or other tangible, intangible, or ephemeral person, place, or thing, or characteristics associated therewith.
  • de-identified data component values 260 may be values derived from corresponding genuine data component values 120, whereby the information has been obfuscated, at least in part, so as to render it more difficult to identify the person or entity it pertains to.
  • Different levels of de-identification may be applied based, at least in part, on a target value of the estimated likelihood of re-identification 210.
  • stronger de-identification methods may reduce the precision of the original source data component value.
  • such de- identification may relate to differential privacy processes.
  • synthetic data components values 270 are fictitious data component values preserving, at least in part, one or more relationships between the corresponding the genuine data components 120 from which they are derived. In some embodiments, this may include fictitious non-numerical values including textual values (e.g. names, addresses, etc.) that are representative in some way of the corresponding genuine data component values in source dataset 114 (e.g. a conventionally male name, an address from a related area or zip code, or the like). In some embodiments, tabulated numerical values (e.g. medical test values, account numbers, etc.) may be generated that are representative one or more statistical relationships between the corresponding genuine data component values. For example, this may include, in some embodiments, learning the joint probability distribution of at least a portion of the genuine data component values 120, and generating therefrom corresponding synthetic data component values 270 having the same or a comparable and/or related probability distribution.
  • textual values e.g. names, addresses, etc.
  • tabulated numerical values e.g. medical
  • synthetic data component values 270 may be pregenerated based on said genuine data component values 120 before the data access request is received. In some cases, this pre-generation may be performed before the contextspecific dataset 206 is generated. In accordance with different embodiments, different types of synthetic data component values may be generated, including text and numerical data. It will be appreciated that various methods or techniques for generating distributions or ensembles of such synthetic data element values may be used, without restriction. Such methods may include, for instance, machine learning (ML) methods, and/or generative models such as Generative Adversarial Networks (GANs).
  • ML machine learning
  • GANs Generative Adversarial Networks
  • the likelihood of re-identification 210 for the dataset may be estimated only from the de- identified component values 260 as, by themselves, synthetic data component values 270 may have no re-identification risk associated therewith.
  • the determination or estimation of the likelihood of re-identification 210 of the contextspecific dataset 206, or elements thereof may also take into account, at least in part, the presence or number of synthetic data component values therein.
  • the number of synthetic data component values 270 in the context-specific dataset 206 may be based on at least one of the following: the permitted re-identification risk value 208 (e.g. trust level between data owners and requestors), and the estimated likelihood of re-identification 210.
  • the permitted re-identification risk value 208 e.g. trust level between data owners and requestors
  • the estimated likelihood of re-identification 210 e.g. trust level between data owners and requestors
  • access to said context-specific dataset 206 may employ one or more of various authorisation approaches, a non-limiting example of which may include Attribute-based Access Control (ABAC), which has been shown to be highly flexible and scalable.
  • ABAC Attribute-based Access Control
  • the context-specific dataset 206 based on or derived from source dataset 114 may be generated upon receipt of data access requests by endusers (e.g. data requestor 110).
  • Examples of the data access requests may include specific requests for age range data for all of the data objects in the data storage.
  • Examples of network-accessible hardware storage resources may include spinning disks connected for distributed data storage.
  • a method for generating context-specific datasets 206 may comprise storing a key-value store in one or more said hardware storage resources, directly generating at least one of the key -value logical rows for a given data object from source data (e.g.
  • the key-value store may comprise a unique key-value logical row for each constituent data component of each data object.
  • Constituent data components of each data object, with each data object related to at least one source data may include information about the source data, such as a file name and file type, information derived from the source data, such as first name and last name, and information formulated through aggregation, employing function-based calculations, or responding to data access requests.
  • Each keyvalue logical row may comprise a key for uniquely identifying the key-value logical row, a constituent data component value comprising stored digital information relating to the constituent data component associated with the key-value logical row, and a metadata descriptor describing metadata of a data component value.
  • the key for unique identification may be stored digital information, which may be a combination or combinations of constituent data component values and metadata descriptors describing metadata of a data component value.
  • the constituent data component may be an actual value or a pointer to the location of the storage where the actual value is stored.
  • the key, the constituent data component, and/or the metadata descriptor may be created, derived, or formulated at run time, at subsequent times, pre-determined, upon data access requests, and/or by system administrator, in accordance with various embodiments.
  • At least one of the key-value logical rows for a given data object may be derived from other key -value logical rows.
  • the derivation may take place under pre-determined requests, upon data access requests, or by system administrator, and/or both at run time or at subsequent times.
  • a data access request is a request for data, which may be automatic, pre-determined, or userspecific.
  • the data access request may be made by an end user or system administrator.
  • the data access request may be received at the run time, or at subsequent times when the data object exists in the system.
  • a same key-value logical row i.e. names, addresses, test values, account values, etc.
  • a context-specific dataset 206 may comprise multiple key -value logical rows, each having values chosen or selected from the genuine data component values 120, de-identified data component values 260 and/or synthetic data component values 270.
  • the risk or likelihood of re-identification 210 of a dataset or collection may be assessed by determining the likelihood or probability that a given set, row, or value can be correlated to an identifiable individual or subject (e.g. identifiable data subject 122).
  • a given derived dataset can be associated with a risk or likelihood of re-identification 210, wherein such a risk provides an indication of a probability that any given data object within the key value store that is made part of a derived dataset can be associated with an identifiable individual or subject to which the data object pertains. The higher such probability, the greater the risk re-identification indication. This risk indication may also be increased depending on the nature of the data object.
  • the data object comprises sensitive personal information, such as, but not limited to, personal health or personal financial information
  • a factor associated with the risk of identification may be increased.
  • the risk of re-identification will decrease if information that is specific to an individual can be withheld from a dataset or obfuscated within a dataset. To the extent that this does not impact the informational value a dataset, or minimally impacts the informational value of a dataset, the re-identification risk can be used to optimally provide informational value while protecting the identity of the subjects of the information within the dataset, in accordance with some embodiments.
  • the re-identification likelihood or risk 210 is a measurement of (a) the likelihood that any data object, data component values, or a collection thereof, can be linked or associated with the subject or subjects to which it pertains (e.g. identifiable data subject 122).
  • the number of same or similar data components within a dataset or other collection can be used to provide such an assessment of reidentification risk.
  • the assessment can provide the ⁇ -anonymity property of a given dataset, although other methods of assessing re-identification risk that may be known to persons skilled in the art can be used, including /-closeness, /-diversity, and privacy differential, ⁇ -anonymity is a property of a given datum, or set of data (including one or more rows) indicating that such datum or set of data cannot be distinguished from k-1 corresponding data or sets of data; an assessment of ⁇ -anonymity may be applied in respect of a particular field or type of metadata in a dataset.
  • the k- anonymity property of data is described in Samarati, Pierangela; Sweeney, Latanya (1998).
  • a risk or likelihood of re-identification is assessed for a given dataset, wherein an acceptable threshold may be applied for a given dataset and/or in respect of a particular field or type of metadata within such dataset.
  • an acceptable threshold may be applied for a given dataset and/or in respect of a particular field or type of metadata within such dataset.
  • PHI personal health information
  • non-PHI a re-identification risk in respect of the PHI data may be provided for the dataset, as well as another reidentification risk in respect of the non-PHI data.
  • PHI personal health information
  • different acceptable threshold risks of re-identification may be applicable than for datasets that do not include PHI.
  • a risk or likelihood of re-identification 210 may be determined for the derived dataset.
  • the re-identification risk 210 may be determined thereafter.
  • the dataset may be made available to particular users. This availability may be a function of sensitivity of values on the dataset (e.g. whether it contains PHI or personal financial information (“PFI”)), or the risk of re-identification (e.g. likelihood of re-identification 210), or the role or trust-level (e.g. permitted re-identification risk factor 208) of the person/entity to whom the dataset is being made available (e.g.
  • PFI personal financial information
  • the re-identification risk or estimated likelihood or reidentification 210 may be associated with the concept of zones of trust, or location-based de-identification controls.
  • the dataset is sent to (or made available to) approved targets, without reference to the location of the target or the security features/risks associated with the target’s location. This may expose a potential risk of re-identification.
  • a Risk Acceptability Threshold RAT may be used based on a determination of the specific risks associated with the circumstances of a data usage instance.
  • a data usage instance may relate to circumstances including a risk or sensitivity associated with the dataset, which may relate to one or both of a re-identification risk and/or the sensitivity of such data,, an indication of user trust (e.g. the permitted re-identification risk value 208 relating to a level of authorisation or trust associated with a given user or entity in association with, in accordance with some embodiments, a sensitivity or sensitivities of the dataset), and/or a location-based and/or security-based risk assessment of the computing devices to where the dataset is to be provided, which may include associated or intermediary computing devices (e.g. if a computing device is highly secure, but data must be transmitted or conveyed thereto via less secure intermediary devices, this may be taken into consideration, in accordance with some embodiments).
  • intermediary computing devices e.g. if a computing device is highly secure, but data must be transmitted or conveyed thereto via less secure intermediary devices, this may be taken into consideration, in accordance with some embodiments).
  • RAT may be determined as Max(Dataset risk, User trust, Location controls).
  • An exemplary process in accordance with embodiments hereof, may include: (1) optionally first determining a RAT associated with a particular collection of data; (2) applying de-identification or obfuscation to specific fields in accordance with methods disclosed hereunder to generate a de-identified dataset; (3) calculating the risk for each record (e.g. data component) in the dataset using a re-identification risk calculation process (e.g. ⁇ -anonymity determination algorithm); (4) applying a filter to the data to meet a designated Risk Acceptability Threshold; and/or (5) restricting the dataset destination to only those targets that meet the Risk Acceptability Threshold.
  • a re-identification risk calculation process e.g. ⁇ -anonymity determination algorithm
  • the location-control indication may be a pre-determined value associated with specific types of locations, or it may be determined in an ad hoc manner based on access or security characteristics associated with a specific location. For example, if a given dataset is associated with a 10% RAT, the dataset could be restricted to locations that meet the necessary location-control indication.
  • a managing entity e.g. PHEMI Central
  • the location-control indication may be associated with a “zone of trust”, within which, possibly based on the security and/or ability for third- parties to access, may allow for the provision of more sensitive or risk-prone datasets.
  • zones of trust may be determined in advance or dynamically depending on criteria relating to security or to indications of such security. Either such case (i.e. pre-determined or dynamically determined based on criteria and/or circumstances), may, in accordance with various embodiments, constitute a designated zone of trust.
  • a given dataset includes data components that present a given ⁇ -anonymity property (or other re-identification risk determination) that is too high for release to, or use by, a given user (or at a user location)
  • additional data components may be derived for a different dataset that, while relating to the same data objects, increase the ⁇ -anonymity score. This might include replacing all data components appearing within the dataset that include an age with a data component that indicates a date range. While this may minimally reduce the informational effectiveness for a researcher, for example, it may nevertheless significantly reduce the re-identification risk.
  • the possible users, locations, and/or user-location combinations that can access or have the dataset delivered thereto may be accordingly increased. Since there is a metric (e.g. RAT) applied to dataset risk, user trust, and location-risk, the system can automatically derive further obfuscated data components for generating new datasets. In some embodiments, the user can indicate which fields should be preferentially obfuscated (or further obfuscated) so as to minimally impact informational effectiveness.
  • a metric e.g. RAT
  • the system can automatically derive further obfuscated data components for generating new datasets.
  • the user can indicate which fields should be preferentially obfuscated (or further obfuscated) so as to minimally impact informational effectiveness.
  • selectively fulfilling a data request means that a request may or may not be fulfilled.
  • the request may be fulfilled in some embodiments, for example, when a risk of re-identification, as indicated by the re-identification risk value associated with a data request, is lower than would be required under the circumstances.
  • Such circumstances may include, but are not limited to: the types of sensitivity (which may be referred to in some cases as an authorisation level) associated with the data being returned in response to a data request; whether or not the request has originated from, or the data is being provided to or accessed from, a designated zone of trust; and/or the identity, role, or other characteristic of the individual or entity making the data request.
  • selectively fulfilling a data request includes circumstances where the contextspecific dataset may not be provided.
  • some, but not all embodiments may result in further actions, including, but not limited to, dynamically creating new datasets based on other key-value logical rows that have been further obfuscated, dynamically creating new but further obfuscated key-value logical rows, or limiting distribution to (or access from) certain types of designated zones of trust.
  • system 100 in accordance with different embodiments, may be further configured to provide improved data management features of real, de-identified and/or synthetic data.
  • this may include, without limitation:
  • Metadata management Managing data that describes and augments datasets, such as the time and location it was acquired. Metadata can describe data at varying levels of granularity (aggregate dataset, file, column, row, cell, etc.).
  • FIG. 3 A first example of an evolving research program using system 100, in accordance with one embodiment, is schematically illustrated in Figure 3.
  • This example shows an exemplary dataset related to cholesterol (e.g. a source dataset 114 comprising, for instance, real cholesterol values for various subjects) being used in a governed research study.
  • access to this dataset is given to a researcher (e.g. data requestor 110) and is managed via system 100.
  • a researcher e.g. data requestor 110
  • a researcher may first have only access to low-risk synthetic data 302 (e.g. a context-specific dataset 206 comprising only synthetic data component values 270).
  • the researcher may then transition, in accordance with the permission granted by the data owner of the exemplary dataset, through various stages of access to different datasets having associated therewith gradually higher privacy risks.
  • the researcher may gradually have access to a context-specific dataset 304 (e.g. context-specific dataset 206) comprising interleaved de-identified data component values 260 and synthetic data component values 270), and later to a context-specific dataset 306 comprising only de- identified data component values 260.
  • a context-specific dataset 304 e.g. context-specific dataset 206 comprising interleaved de-identified data component values 260 and synthetic data component values 270
  • a context-specific dataset 306 comprising only de- identified data component values 260.
  • real data 308 e.g. the contextspecific dataset 206 comprising only genuine data component values 120
  • the research lifecycle of data may be characterised by increasing levels of trust invested in a research project by a data owner or administrator.
  • the goal may to promote research, while managing the risk to patient privacy.
  • a data requestor or consumer e.g. a researcher
  • a new research program may only be given access to synthetic data. They may not yet have sufficient clearance to access highly sensitive PHI, and they may not yet have a mature process in place for protecting sensitive data.
  • a data owner may minimise the risk that data is exposed and therefore triggering a privacy violation.
  • This phase may, in accordance with some embodiments, be dominated by experimentation and iterative algorithm development.
  • Synthetic data is very well suited to a researcher in this phase. As their work progresses, the researcher may want to progress to working with a hybrid of synthetic data and real, but de-identified data. This is a logical step as methods mature, progressing the researcher towards real data while minimises the risk to patient confidentiality.
  • a further research step may comprise allowing a researcher to access and/or use a full dataset of real, but de-identified data, thereby allowing the researcher to, for instance, measure and account for any artificial effects arising from the use of solely synthetic data and that are not mirrored in a real dataset.
  • the risk to patient privacy in such a step is low, as data is still de-identified; however, it will be appreciated that the risk to patient privacy is not zero, as de-identified data may still be re-identified (i.e. a unique person may still be associated with a previously anonymised record) via statistical techniques, as well as via correlation with publicly available datasets.
  • de-identification may be tuned, in accordance with various embodiments, using a risk score that quantifies the potential for re-identification (e.g. which may be part of the likelihood of re-identification 210 of the resulting contextspecific dataset 206).
  • a risk score that quantifies the potential for re-identification (e.g. which may be part of the likelihood of re-identification 210 of the resulting contextspecific dataset 206).
  • a final step for a researcher may comprise validation using a real dataset.
  • a researcher may be assumed, for instance, to have developed a well-articulated hypothesis that is proven against de- identified data.
  • the last step may therefore comprise establishing that no artefacts have been introduced to a result through the use of de-identified data, or the process of deidentifying data.
  • the relationship between the researcher and the data owner is mature, and the data owner may trust that the researcher’s processes for working with data are rigorous and will not lead to privacy violations.
  • the researcher may accordingly be certified for accessing genuine and/or sensitive data.
  • system 100 empowers the data administrator to control the movement of a researcher through each step of research.
  • System 100 has the feature that it automatically provides a view of data tailored to the entitlements (e.g. the permitted re-identification risk value 208) of a researcher at that moment or that research stage, in accordance with some embodiments.
  • Figure 4 schematically illustrates another example of a researcher interacting with data in a single virtual location.
  • a privacy-preserving data management system e.g. system 100
  • automatically provides a view e.g. different context-specific datasets 206) in accordance with a current user entitlement.
  • the platform manages multiple pre-generated columns (dataset 402) with different risk profiles.
  • the system Upon receiving a data request, the system automatically provides access to the correct concrete column instance for a researcher’s entitlement (e.g. the context-specific dataset 206 for this particular source data usage instance).
  • dataset 404 is provided, which comprises only synthetic data component values.
  • a second data usage instance may have access to dataset 406 enabled, wherein the second dataset 406 comprises both synthetic data component values and de-identified component values.
  • a third data usage instance may have access to dataset 408, which comprises only de-identified data component values (e.g. no longer comprises synthetic values).
  • dataset 410 may be provided, wherein the dataset 410 corresponds to genuine data component values of the source dataset. This removes from the research (e.g. data requestor 110) any burden of data management, putting it instead into the hands of a qualified data administrator (e.g. data owner 116).
  • the data requestor 110 interacts with data from a single source.
  • the data requestor 110 does not have to manage multiple different files representing the different phases of the data request process (which adds a massive administration burden, as well as likelihood for errors and privacy violations). Instead, the data requestor 110 always interacts with data from one location, and receives a “live” view of the data that is tailored to their current entitlement(s) (e.g. the contextspecific dataset 206).
  • system 100 may be configured to adjust views (e.g. access to different instances of context-specific datasets 206) based on various factors, such the identity of the requestor 110, their relationship with the data owner 116, the phase of research, and/or the characteristics of the data (e.g. data sensitivity, when data was acquired, the data owner’s name, or the like).
  • views of data e.g. access to context-specific dataset 206
  • various other elements of data governance may be accommodated for.
  • data requestor 110 can locate datasets, request access, use earlier versions, etc., in accordance with various embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Storage Device Security (AREA)

Abstract

Described are various embodiments of a privacy-preserving data governance system and method.

Description

DATA GOVERNANCE SYSTEM AND METHOD
RELATED APPLICATION
[0001] The instant application claims the benefit of priority to U.S. Provisional Patent Application serial number 63/251,936, entitled: “DATA GOVERNANCE SYSTEM AND METHOD”, and filed October 4, 2021, which is herein fully incorporated by reference.
FIELD OF THE DISCLOSURE
[0002] The present disclosure relates to scalable, secure and policy-compliant distributed data storage systems, and, in particular, to a privacy-preserving data governance system and method.
BACKGROUND
[0003] Medical data has enormous value to researchers in healthcare, but this value is often not realised because of the need to protect patient confidentiality. Access to Protected Health Information (PHI) is highly controlled in most jurisdictions because of its sensitivity. In the US, the HIPAA Privacy Rule outlines expectations on organisations for protecting from disclosure PHI that they hold. In Europe, GDPR is a more generalised privacy legislation that includes data concerning health and genetic information. GDPR is highly prescriptive and includes extremely punitive fines for violation of an individual’s right to privacy. While these legislative frameworks have the patient’ s best interest in mind, they have been observed to stifle research. This can be either due to adding administrative complexity for patients who would otherwise be willing to contribute their data to a research study, or creating too much difficulty for researchers needing access to large volumes of data to pursue their investigations.
[0004] Data de-identification is one technique that helps overcome legislative hurdles. Algorithms exist to anonymise a dataset so that it can be distributed to researchers without triggering a privacy violation. This might include techniques such as removing patient names, which are considered primary identifiers. However, quasi-identifiers also commonly exist in data records. These can often be used alone or in combination with other identifiers to re-identify an individual. For example, a simple record containing a birthdate and US zip code can be tied to an individual person over 50% of the time by correlating with public databases. Therefore, quasi-identifiers should be transformed to make them more ambiguous by lowering their fidelity. Substituting age for a birthdate is a common example that in most cases does not overtly degrade other underlying statistical relationships in the data.
[0005] Another solution that has shown recent promise to open medical data for researchers is the use of synthetic data. The generated dataset contains no real measured data, though under casual inspection and statistical or other analysis, it appears to be a real dataset. In the case of a medical dataset containing PHI (such as name, age, results of a cholesterol test, etc.), the synthetic data would include the same columns, and it would mirror the underlying statistical relationships in the real dataset (such as age as a loosely correlating factor). However, it would not actually describe any real, existing individuals.
[0006] Research conclusions should always be validated, however, by ultimately testing any new hypothesis on real data. Despite the promise of synthetic datasets to mirror the statistics of a real dataset, one should always exercise healthy skepticism that a genuine underlying cause is being measured in the generated data. Prudent researchers, therefore, use synthetic data for model development and testing, but real data for the actual study. Therefore, both real data and synthetic datasets must be made available, but carefully managed. If a real dataset was mistaken for a synthetic one and released to general researchers who do not have clearance to access PHI, it could trigger a serious compliance violation around maintaining patient confidentiality.
[0007] Some organisations might generate synthetic data using capabilities in a generalised health data platform. For example, certain products include features to generate synthetic data from real data so researchers can export and share a safe (though synthetic) dataset with collaborators. However, these products are designed as a generalised data analysis platform; they do not govern distribution of synthetic data files or manage an orderly transition of a researcher through all steps of the research lifecycle from development through to validation.
[0008] This background information is provided to reveal information believed by the applicant to be of possible relevance. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art or forms part of the general common knowledge in the relevant art.
SUMMARY
[0009] The following presents a simplified summary of the general inventive concept(s) described herein to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is not intended to restrict key or critical elements of embodiments of the disclosure or to delineate their scope beyond that which is explicitly or implicitly described by the following description and claims.
[0010] A need exists for a privacy-preserving data governance system and method that overcome some of the drawbacks of known techniques, or at least, provides a useful alternative thereto. Some aspects of this disclosure provide examples of such systems and methods.
[0011] In accordance with one aspect, there is provided a privacy-preserving data management system for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, the system comprising a plurality of network-accessible hardware storage resources, each of the hardware storage resources being in network communication and configured for distributed storage of a source dataset, the source dataset comprising a plurality of source data objects each comprising constituent genuine data component values that are associated with a corresponding data subject. The system further comprises a digital data processor for receiving and responding to the data request, the digital data processor being communicatively linked to a network via a communication bus, the digital data processor configured to generate a plurality of synthetic data component values preserving, at least in part, one or more relationships between the genuine data component values amongst at least some of the plurality of source data objects, store the plurality of synthetic data component values, and, in response to the data request, the data request having a permitted access privilege specific to a particular data usage instance associated therewith, generate a context-specific dataset, wherein the context-specific dataset corresponds at least in part to the plurality of source data objects and comprises at least some synthetic data component values depending on, at least in part, the permitted access privilege.
[0012] In one embodiment, the synthetic data component values are generated using a generative model.
[0013] In one embodiment, the digital processor is further configured to generate de- identified data component values corresponding to at least some of the genuine data component values, and replace in the context-specific dataset at least some of the genuine data component values with the corresponding the de-identified data component values depending on, at least in part, the permitted access privilege.
[0014] In one embodiment, replacing some of the genuine data component values with the corresponding the de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on the permitted access privilege.
[0015] In one embodiment, the context-specific dataset comprises at least some of the genuine data component values of the source dataset.
[0016] In one embodiment, the context-specific dataset is generated before the data request is received.
[0017] In one embodiment, the permitted access privilege is based on an estimated likelihood a given data object in the context-specific data set can be associated with one of the identifiable data subjects.
[0018] In one embodiment, the permitted access privilege is based on one or more access permissions associated with the data requestor. [0019] In one embodiment, the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on the permitted access privilege.
[0020] In one embodiment, each of the network-accessible hardware storage resources further comprises a key -value store configured to store a unique key -value logical row for each of the data objects.
[0021] In one embodiment, each key-value logical row comprises a key, a metadata descriptor, and a data object identifier.
[0022] In one embodiment, the key-value logical row comprises at least one of authorization information, data sensitivity information, or timestamp information.
[0023] In one embodiment, the key-value logical row comprises a key-value logical row access authorisation value for restricting access to the corresponding key -value logical row, the authorisation value based at least in part on the permitted access privilege.
[0024] In accordance with another aspect, there is provided a computer-implemented privacy-preserving data management method, automatically implemented by one or more digital processors, for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, the method implemented on a data management system comprising a digital processor for receiving the data request and a plurality of network-accessible hardware storage resources each being in network communication and configured for distributed storage of a source dataset comprising a plurality of source data objects, each of the plurality of source data objects comprising constituent genuine data component values that are associated with a corresponding data subject. The method comprises generating a plurality of synthetic data component values at least in part preserving one or more relationships between the genuine data component values amongst at least some of the source data objects, storing the plurality of synthetic data component values, and, in response to the data request, the data request having a permitted access privilege specific to a particular data usage instance associated therewith, generating a context-specific dataset. In one embodiment, the context-specific dataset corresponds at least in part to the plurality of source data objects and comprises at least some the synthetic data component values depending on, at least in part, the permitted access privilege.
[0025] In one embodiment, the synthetic data component values are generated using a generative model.
[0026] In one embodiment, the method further comprises generating de-identified data component values corresponding to at least some of the genuine data component values, and replacing in the context-specific dataset at least some of the genuine data component values with the corresponding the de-identified data component values depending on, at least in part, the permitted access privilege.
[0027] In one embodiment, replacing some of the genuine data component values with the corresponding the de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on the permitted access privilege.
[0028] In one embodiment, the context-specific dataset is generated to comprise at least some of the genuine data component values from the source dataset.
[0029] In one embodiment, generating the context-specific dataset is done before the data request is received.
[0030] In one embodiment, the access privilege is based on an estimated likelihood a given data object in the context-specific data set can be associated with one of the identifiable data subjects.
[0031] In one embodiment, the permitted access privilege is based on one or more access permissions associated with the data requestor.
[0032] In one embodiment, the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on the permitted access privilege. [0033] In one embodiment, the access privilege comprises a risk acceptability threshold (RAT).
[0034] In accordance with another aspect, there is a provided a computer-readable medium having stored thereon instructions for execution by a computing device for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, the computing device being in network communication with a plurality of network-accessible hardware storage resources each being in network communication and configured for distributed storage of a source dataset comprising a plurality of source data objects, each of the source data objects comprising constituent genuine data component values that are associated with a corresponding data subject. The instructions are executable to automatically implement the steps of generating a plurality of synthetic data component values at least in part preserving, one or more relationships between the genuine data component values amongst at least some of the source data objects, storing the plurality of synthetic data component values, and, in response to the data request, the data request having permitted access privilege specific to a particular data usage instance associated therewith, generating a context-specific dataset. In one embodiment, the context-specific dataset corresponds at least in part to the plurality of source data objects and comprises at least some of the synthetic data component values based at least in part on the permitted access privilege.
[0035] In one embodiment, the synthetic data component values are generated using a generative model.
[0036] In one embodiment, the steps further comprise generating de-identified data component values corresponding to at least some of the genuine data component values, and storing the de-identified data component values, wherein the context-specific data is generated to further comprise at least some the de-identified data component values depending on, at least in part, the permitted access privilege.
[0037] In one embodiment, replacing some of the genuine data component values with the corresponding the de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on the permitted access privilege.
[0038] In one embodiment, the context-specific dataset is further generated to comprise at least some of the genuine data component values from the source dataset.
[0039] In one embodiment, generating the context-specific dataset is done before the data request being received.
[0040] In one embodiment, the permitted access privilege is based on an estimated likelihood a given data object in the context-specific data set can be associated with one of the identifiable data subject.
[0041] In one embodiment, the permitted access privilege is based on one or more access permissions associated with the data requestor.
[0042] In one embodiment, the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on the permitted access privilege.
[0043] In one embodiment, the permitted access privilege is a risk acceptability threshold (RAT).
[0044] Other aspects, features and/or advantages will become more apparent upon reading of the following non-restrictive description of specific embodiments thereof, given by way of example only with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE FIGURES
[0045] Several embodiments of the present disclosure will be provided, by way of examples only, with reference to the appended drawings, wherein:
[0046] Figure 1 is a schematic diagram illustrating a privacy-preserving data access system, in accordance with one embodiment; [0047] Figures 2A and 2B are schematic diagrams illustrating a privacy-preserving data access method using the system of Figure 1, in accordance with one embodiment;
[0048] Figures 3 and 4 are schematic diagrams illustrating exemplary use cases of the method of Figures 2A and 2B, in accordance with different embodiments; and
[0049] Figure 5 is a schematic diagram illustrating exemplary machine learning approaches, in accordance with various embodiments.
[0050] Elements in the several figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be emphasised relative to other elements for facilitating understanding of the various presently disclosed embodiments. Also, common, but well-understood elements that are useful or necessary in commercially feasible embodiments are often not depicted in order to facilitate a less obstructed view of these various embodiments of the present disclosure.
DETAILED DESCRIPTION
[0051] Various implementations and aspects of the specification will be described with reference to details discussed below. The following description and drawings are illustrative of the specification and are not to be construed as limiting the specification. Numerous specific details are described to provide a thorough understanding of various implementations of the present specification. However, in certain instances, well-known or conventional details are not described in order to provide a concise discussion of implementations of the present specification.
[0052] Various apparatuses and processes will be described below to provide examples of implementations of the system disclosed herein. No implementation described below limits any claimed implementation and any claimed implementations may cover processes or apparatuses that differ from those described below. The claimed implementations are not limited to apparatuses or processes having all of the features of any one apparatus or process described below or to features common to multiple or all of the apparatuses or processes described below. It is possible that an apparatus or process described below is not an implementation of any claimed subject matter.
[0053] Furthermore, numerous specific details are set forth in order to provide a thorough understanding of the implementations described herein. However, it will be understood by those skilled in the relevant arts that the implementations described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the implementations described herein.
[0054] In this specification, elements may be described as “configured to” perform one or more functions or “configured for” such functions. In general, an element that is configured to perform or configured for performing a function is enabled to perform the function, or is suitable for performing the function, or is adapted to perform the function, or is operable to perform the function, or is otherwise capable of performing the function.
[0055] It is understood that for the purpose of this specification, language of “at least one of X, Y, and Z” and “one or more of X, Y and Z” may be construed as X only, Y only, Z only, or any combination of two or more items X, Y, and Z (e.g., XYZ, XY, YZ, ZZ, and the like). Similar logic may be applied for two or more items in any occurrence of “at least one ...” and “one or more...” language.
[0056] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
[0057] Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrase “in one of the embodiments” or “in at least one of the various embodiments” as used herein does not necessarily refer to the same embodiment, though it may. Furthermore, the phrase “in another embodiment” or “in some embodiments” as used herein does not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments may be readily combined, without departing from the scope or spirit of the innovations disclosed herein.
[0058] In addition, as used herein, the term “or” is an inclusive “or” operator, and is equivalent to the term “and/or,” unless the context clearly dictates otherwise. The term “based on” is not exclusive and allows for being based on additional factors not described, unless the context clearly dictates otherwise. In addition, throughout the specification, the meaning of "a," "an," and "the" include plural references. The meaning of "in" includes "in" and "on."
[0059] As used in the specification and claims, the singular forms “a”, “an” and “the” include plural references unless the context clearly dictates otherwise.
[0060] The term “comprising” as used herein will be understood to mean that the list following is non-exhaustive and may or may not include any other additional suitable items, for example one or more further feature(s), component(s) and/or element(s) as appropriate.
[0061] The systems and methods described herein provide, in accordance with different embodiments, different examples a data governance platform that provides the ability to create different views of data comprising interleaved synthetic data and de- identified data that are derived from the actual or live data, as a function of the contextual requirements relating to the data and/or data consumer. Such requirements often include privacy compliance but may also include other administrative, analytics, management, or use-related requirements.
[0062] As written above, there are multiple cases where both real data, and synthetic datasets must be both made available to an external party. However, these datasets must be carefully managed. If a real dataset was mistaken for a synthetic one and released to general researchers who do not have clearance to access, for example, PHI, it could trigger a serious compliance violation around maintaining patient confidentiality.
[0063] In addition, synthetic data is a good solution for accelerating research programs in their early phase; however, the addition of a new dataset with different privacy obligations does create management challenges. Large research organisations have difficulties scaling and managing data. Modern research protocols often produce very high data volumes and complex datasets spread across multiple source files. Organisations must serve a large population of researchers with differing needs from data and at different stages of their research. A huge problem exists around governing how data is managed during the research life cycle from development, to test, to final validation. Adding synthetic datasets, which have a relationship to real data and are customised to each researcher’s needs and entitlements to work with sensitive data, exacerbates this already significant problem.
[0064] Similarly, de-identification of data is well established technology. Nevertheless, it is easy to do wrong, with often catastrophic results for patient privacy. Typically, de- identified datasets are generated as separate, distinct files that are managed separately from the original real data from which they were derived.
[0065] Synthetic data generation is a maturing technology that has made its way from academic journals to trade engineering publications. There are also dramatic examples of the use of synthetic data on the Internet showing its application to various datasets, including, for instance, human faces. Generally, and in accordance with various embodiments, synthetic data may be broadly understood as relating to ‘artificial’ data that is generated from real data, while maintaining or replicating underlying statistics of the real data.
[0066] Different techniques exist to generate a synthetic dataset from an existing dataset. For example, Figure 5 shows an exemplary taxonomy of generative models 500 that may be employed to this end, in accordance with different embodiments. In this example, generative models 500 may relate to explicit density models 502, which may in turn relate to tractable density models 504 or approximate density models 506. While tractable density models 504 often relate to fully visible belief nets 508 (e.g. NADE, MADE, PixelRNN models, or the like), approximate density models 506 may comprise variational models 510 (e.g. variational autoencoders 512, or VAEs 512) or Markov chain models 514 (e.g. Boltzmann machines 516). Implicit density models 518, on the other hand, may comprise direct generative models 520, such as generative adversarial networks (GANs) 522, or implicit density Markov chain models 524 (e.g. a GSN 526). While various embodiments herein described relate to the generation of synthetic data using GANs 522 or VAEs 512, it will be appreciated that various other embodiments relate to the generation of synthetic data using alternative machine learning (ML) or generative models 500, nonlimiting examples of which are schematically shown in Figure 5. In accordance with yet other embodiments, synthetic data may be generated in accordance with or by a deep learning platform.
[0067] While a generative model such as a GAN may replicate the underlying statistics of real data in an artificial dataset, such techniques generate data as a file that may be distributed using conventional means, such as FTP, email, a shared file system, or the like. There is generally no governance (e.g. the control of data throughout its lifecycle by an administrative authority) put in place to manage the synthetic data output. That is, there is generally no enforcement of access control to ensure that only the authorised party can access data, control copies of data, audit data usage, delete data after it is no longer needed, show data provenance of file versioning, or the like.
[0068] For example, a data administrator can easily apply a GAN on a real dataset to create a new, synthetic dataset. The result would likely be a simple file of data that mirrors the structure of the original, but contains no actual records from real people. But once this is given to a researcher, there is a loss of context and control. The synthetic data passes from control of the administrator, to control of the researcher. This handoff is the source of a number of problems, despite the fact that the data contains no actual sensitive personal information.
[0069] The first set of problems relates to the governance of the synthetic data. Synthetic data is typically disconnected from an original dataset, despite it being derived from the former (or from a closely related dataset). The loss of this relationship creates a number of problems. Orphaned synthetic datasets have no provenance. Provenance is very important, as researchers need to be confident that there is a well-documented path of transformations, queries and filters taken by the dataset they are working on. This could include multiple steps, from original acquisition, to cleanup (aka Data Wrangling), filtering, j oining with other datasets, de-identification, synthetic data generation, or the like. Provenance is very important because it documents a chain of steps that should be reproducible should any questions arise around data integrity in the course of research or after publication of results. It is also common for publications to require publication of datasets (subject to privacy issues), and it is crucial that a researcher have confidence in their ability to reproduce resulting data whenever this is mandated. A synthetic data file may also have nothing to identify it as synthetic data, which could inadvertently trigger a HIPAA or GDPR investigation if it is stolen. There may be no way to version the files and roll back if an error is found. The researcher’s use of a casually distributed synthetic dataset is not subject to any kind of audit. They could make infinite copies and distribute these however they like. There is no way to cut off a researcher’s access if the relationship breaks down, as the data may reside on backups outside of the control of the data administrator. There is no way to delete or pull back a dataset once it has been distributed.
[0070] Furthermore, there is no management of the use of data through the research life cycle. All research goes through phases. For example, and in accordance with some embodiments, a researcher may begin with initial development, phasing into increasingly detailed testing, and finally to validation on real data and determination of conclusions. This life cycle may gradually reveal sensitive source data according to the increasing trust in a researcher. Under such a research model, a researcher may commence work using fully synthetic datasets (which have no privacy risks). They then transition to semi -synthetic data (file interleaves real but de-identified fields with fully synthetic fields to minimise risk of an individual being identified and their sensitive personal data being revealed). The next step moves to fully real data that has been sufficiently de-identified to preserve patient privacy (increased risk, but quantifiable; this dataset should be demonstrating true underlying relationships). Finally, hypotheses should be validated on real data (where there is a high risk to patient confidentiality but the researcher is now trusted). This final step is the gold standard in validation of the research. Unfortunately, the above represents an idealised sequencing that, using conventional research methods, would be extremely labour- and process-intensive to implement. [0071] Good governance also attempts to minimise proliferation of file copies, or near copies (in the case of de-identified data, or synthetic data), and may also work to delete expired datasets so they are not lost or orphaned after their initial purpose is complete. Both de-identified and synthetic data are useful tools for researchers, but these must be carefully managed along with real data. There is a strong relationship between these data, and managing them at scale, with complex, large datasets and large populations of researchers, without violating patient confidentiality, remains a very difficult challenge.
[0072] Thus, in accordance with different embodiments, the systems, devices and methods described herein implement the abovementioned data governance of real, de- identified and synthetic datasets, and promotes the relationship therebetween. They enable a binding relationship between real, de-identified, and synthetic data, as well as other related and derived works. This relationship can be used to better govern these datasets in a way that promotes use of data without violating security and privacy requirements. For example, a synthetic data file of cholesterol tests, and a real dataset from which it is derived (which would demand special handling because it contains PHI), have direct relationship, even though they contain different data. There is a morphologic similarity (e.g. types, such as birthdates, appear in each), as well as a descendant relationship (the synthetic data is derived from the real data using, for instance, a GAN or other generative model). There are similar relationships for de-identified datasets derived from real datasets. By using this close relationship, a system or method, in accordance with various embodiments, may provide views of either a synthetic dataset, a real dataset, or hybrid combination of the two, to an authorised researcher. These views can be managed by a data owner or administrator, who is responsible for guiding a researcher to use data that are appropriate for where they are in their research timeline. These views may reflect a researcher’s immediate relationship with the data owners.
[0073] With reference to Figure 1, and in accordance with one exemplary embodiment, a privacy-preserving data governance system, generally referred to using the numeral 100, will now be described. As will be made clear below, in this exemplary embodiment, system 100 is directed to providing data governance or management capabilities over the lifecycle of real, de-identified and/or synthetic data together as related elements. This provides for the management of sensitive data, balanced with accessibility for researchers to promote valid use of the data.
[0074] As illustrated schematically in Figure 1, system or platform 100 generally comprises a computing device 101, the device comprising at least one digital data processor 104 communicatively linked to accessible memory 106 and a communication bus 108, the communication bus 108 itself configured to be in network communication with a data requestor 110 and a plurality of remote independent network-accessible hardware storage resources 112.
[0075] In some embodiments, each of the hardware storage resources 112 has stored thereon at least one dataset. Generally, these datasets will include a plurality of data objects, each having corresponding constituent data element values. In the example of Figure 1, this may include a source dataset 114 comprising source data objects 118, each source data object 118 comprising genuine data component values 120 which correspond to or are associated with a corresponding identifiable data subject 122 (i.e. name, address, postal code, etc.). Generally, these genuine data component values 120 may also include privacysensitive information (e.g. protected health information (PHI) of various individuals, such as birth dates, medical test results, medical images, or the like, which may comprise “raw” or processed values), but could also include other types of privacy-sensitive information or data (e.g. financial information, or the like). In general, access to a source dataset 114 will be under the supervision or administration of a data owner or administrator 116, which has full control of the parameters under which the source dataset 114 may be accessed by the data requestor 110 via system 100. Thus, system 100 allows data owners 116 to find their own balance between protecting sensitive data and promoting research interests via increased access to data.
[0076] In some embodiments, the digital data processor 104 may be configured to respond to data storage requests received over a network and relating to the data objects 118. In some embodiments, the network communications interface communicatively interfaces one or more requesting users (e.g. data requestor 110) and a key -value stored on one or more of the hardware storage resources 112. A key-value store configured to store a unique key -value logical row for each constituent data object component of each data object may comprise, in accordance with some embodiments, a key, a constituent data component value, and a metadata descriptor. At least one of the key -value logical rows for a given data object may be directly associated with source data and at least one of the keyvalue logical rows of the given data object is derived from one or more other key -value logical rows. In response to a data access request from data requestor 110 based on a given metadata descriptor, the digital data processor 104 may generate an independent dataset from the key-value store by accessing those key-value logical rows having metadata descriptors responsive to the data access request.
[0077] In some embodiments, data owner 116 may take the form of any individual and/or private or public organisation (companies, administrative bodies, governmental agencies, etc.) which has ownership of source dataset 114. Generally, data owner 116, via system 100, has full control of the access permissions or entitlements given to data requestor 110. In some embodiments, those access permissions or entitlements may be determined or allocated for a given source data usage instance (e.g. a data request from a data requestor 110 for a data object related to an identifiable data subject 122). For example, a particular data requestor 110 may be assigned an access privilege based on, for instance, the level of trust that the data owner 116 has in the requestor 110. Such an access privilege may be based on, for instance, a position or role of the requestor (e.g. doctor, hospital administrator, financial auditor, or the like), a level of trust that has been otherwise established between the data owner 116 and the requestor 110, and/or a perceived or quantifiable likelihood that data access for a particular requestor may lead to reidentification of any data with an identifiable data subject 122. Additionally, or alternatively, an access permission may be assigned to a source data usage instance based on, for instance, the stage of a research lifecycle associated with a particular data usage instance.
[0078] It will be appreciated that a source data usage instance may comprise a query or other form of data request, or a plurality thereof. For example, a source data usage instance may comprise multiple simultaneous data requests under the same source data usage instance. Additionally, or alternatively, a data usage instance may comprise multiple subsequent data requests associated with, for instance, a particular stage of a research lifecycle. For example, a data requestor 110 may, in accordance with some embodiments, request access to data to evaluate and/or train a health science model. Upon determination of a result from a first iteration of their model, the requestor 110 may then adjust model parameters, and again request access to data to test their updated model. This process may, in accordance with some embodiments, be repeated within the same source data usage instance.
[0079] Such requests, whether simultaneous or sequential, may be constrained by the same access permissions. In some embodiments, data requestor 110 and data owner 116 may negotiate for different access privileges or permissions, upon which a subsequent source data usage instance may be associated these new privileges or permissions. In some embodiments, these access privileges, permissions, or entitlements may accordingly be based at least in part on a trust threshold associated with the requestor 110, or a stage of research associated with the source data usage instance.
[0080] In accordance with some embodiments, an access privilege associated with a source data usage instance may, additionally or alternatively, relate to an estimated likelihood that a given data object accessed in response to a request may be associated with one of the identifiable data subjects, as further described below. Accordingly, an access privilege may, in accordance with some embodiments, relate to a degree of obfuscation (i.e. de-identification) or generation of synthetic data corresponding to genuine data accessed in response to a data request. Such a likelihood of re-identification may be based on any one or more privacy preserving processes, a non-limiting example of which may include a differential privacy process.
[0081] For a given source data object 118 of a source dataset 114, such as, for example, a patient record, or indeed a patient, at least some if not all of the available data may be ingested as individual discrete portions of data, along with a metadata descriptor of each portion. In accordance with some embodiments, a key may be associated with the entry for, in part, future identification. Accordingly, a key-value store may comprise logical rows, wherein each logical row comprises an individual portion of the source data, a constituent data component value (the “value”), an identifier (the “key”), a metadata descriptor, a data object identifier, and optionally, in accordance with different embodiments, additional management information, such as authorisation, sensitivity, or other compliance and/or timestamp information. The key -value store that comprises logical rows, wherein each logical row comprises a constituent data component value and a key identifier, may also be referred to as the key-value pair. The collection of all logical rows for a given data object may comprise the digital asset, which may also include the source data. However, in many embodiments, there may be a logical row associated with the source data; e.g. a patient record in a text file or PDF format.
[0082] The concept of a data object may, in some embodiments, be considered to broader than the data asset, and may refer to all information, whether existing or potential, regarding any entity, such as a patient, hospital, doctor, bank, transaction, etc. In one exemplary embodiment, considering an existing patient as a data object and a patient record as the source data, a first logical row may consist of an object ID relating to the patient, a unique identifier (the key), a metadata descriptor of “source data”, and a value corresponding to the patient record data file itself. From the source data file, additional logical rows may be created for every discrete portion of source data.
[0083] Additional logical rows can then be derived from the existing logical rows, as well other applicable information. For example, derived logical rows corresponding to existing logical rows can be generated that aggregate or obfuscate existing logical rows. When combined with specific other logical rows, any of the existing logical rows, either imported (i.e. ingested) or derived (i.e. curated), can be provided along with - or excluded from - access requests associated with the derived logical row. Because there are very few limits on how such derived logical rows can be generated, and all of the data of the data asset are highly granularised to individual discrete pieces of data, provision and use of data associated with any given data object (or class or group of data objects) can be managed at the level of each such piece of data. That is, data may be managed far below the level of the data object or table level, as is a limitation in state-of-the-art systems. It some embodiments, the value-portion of a given logical row may be the actual value (“raw” data, images, or the like), or it may be a reference, direct or indirect, to the value and/or storage location of the value.
[0084] In embodiments, a key-value store may be employed for granular governance and flexible curation of digital assets. Embodiments hereof can receive unstructured or structured data as an input. In some cases, the input data could be acquired from a patient record, a financial record or other type of record and can come in several formats such as PDF, CSV or other types of electronic or non-electronic inputs. In accordance with one aspect, a key-value store is a data storage structure designed for storing, retrieving, and managing associative arrays, which contains a collection of objects or records, which in turn has different fields within them, each containing data. In some embodiments, the data included in a data collection will have related attributes so that the data can be stored, retrieved and managed in an efficient manner and this data collection can be derived, generated or calculated after or during curation. These records are stored and retrieved using a key identifier that uniquely identifies the record, and is used to quickly find data within a database. In addition to storing, retrieving, and managing associative arrays using the key identifier, disclosed implementations of the key-value store allow, as will be discussed below, the generation of context-specific datasets that are generated from the key-value store itself (keeping in mind that, in some embodiments, the “value” portion of a logical row can be the associated piece of data, or a reference thereto). Such generated datasets may be based on further utilisation of additional descriptors and indicators, depending on the data access request.
[0085] In some embodiments, the source dataset 114 may comprise any type of source data in various formats, including PDF files, text files, CSV, database information, images, or spreadsheet documents, that is extracted and stored as a data object comprising a keyvalue logical row, which comprises at least constituent data component values and the associated metadata descriptors. The data object is associated with source dataset 114, as well as all other logical rows that have been or may be created. Multiple and separate records relating to a data object, e.g. a patient, may constitute an example where a data object may be associated with more than one source dataset. In some embodiments, at runtime and/or subsequent to the ingestion or receipt of the source data, metadata of the source data may be collected, derived, or formulated and are stored as key-value logical rows, with its unique key, constituent data component values, and associated metadata descriptor. In embodiments, the metadata associated with a given logical row is a type of data that describes and gives information about the data to which the logical row pertains. For example, the metadata could be “raw data”, “file type”, “patient ID”, “name”, with the value associated therewith, as extracted from the source data or a derived from other data, stored in the same logical row. Each collected, derived, or formulated key-value entry is stored in the key -value data store as a key -value logical row, the rows collectively forming a data asset or a portion thereof. Examples of metadata include the name of the file, the type of the file, the time the file was stored, the source (e.g. raw) data itself, and the information regarding who stored the file. The collected information may be parsed and saved in a key-value store as a key-value logical row with its respective key for unique identification, constituent data component value, and metadata descriptors.
[0086] Concurrent to the collection of the information, at the run time or at subsequent times when the source data exists in the key-value store, the source data may be parsed for acquisition of metadata. The acquired metadata are stored in the key-value store with respective key for unique identification, constituent data component value, and metadata descriptors. The metadata preliminarily derived may be saved as key -value logical rows in the key -value store, wherein key -value logical rows may collectively form a data object associated with a source data. First name, last name, type of disease, date of financial transaction, and age are non-limiting examples of acquired data. Furthermore, derived metadata may be derived from other logical rows, including either source data, acquired data from the source data, or other derived data. In some embodiments, it may be derived from other information associated with a data object, rather than directly from the existing data asset. The metadata associated with derived logical rows are stored in the key-value store as part of the key-value logical rows with the logical row unique identifier (such unique identifier being a unique key), a data object identifier, and constituent data component value.
[0087] As will be discussed below, in some embodiments, system 100 may be further configured such that metadata may be employed to formulate and output a context- and/or requestor-specific dataset. For example, a dataset may be generated from a key-value store by accessing only obfuscated logical rows, as well as other lower-sensitivity (or other access criteria); accordingly, a derived dataset that is separate from the source data, or even the key-value store data is specifically produced for a certain context - and that context may be determined or created by generating specific types of logical rows based on predetermined metadata. Another example may include a patient dataset where a derived logical row includes an age range, or first three digits of a postal code, and the resulting derived dataset is generated by accessing all non-identifying information regarding disease types and outcomes for a group of patients along with the aforementioned derived logical row. Without providing access to the source data, an analysis of the dataset can be performed, wherein disease frequency by age or location can be assessed without giving any direct access to sensitive information. As the logical rows can be generated before ingestion for automatic curation or after for more customised curation, dataset creation can be dynamic and compliant irrespective of the type of information stored regarding data objects.
[0088] In some embodiments, the use of a key-value store paradigm, such as Apache Accumulo, may be used to provide granular access control to the data. The use of a keyvalue store, such as Accumulo, provides cell-level security with a visibility field in the key. The use of a key -value store paradigm is a data model that stores source data in a key -value pair and metadata values in the same logical row as additional key -value pairs. The column visibility field is used to store data attributes related to governance or compliance rules specified by the user and/or data owner.
[0089] In some embodiments, a constituent data component value (e.g. genuine data component values 120) may comprise stored digital information directly, or point to a location in storage where the digital information is stored. In some embodiments, the metadata descriptors may be formed in response to the data access request. In some embodiments, the data access request would comprise of pre-determined metadata descriptors and new metadata descriptors either by system administrator or end-user (i.e. request for a specific use and/or context). In some embodiments, the pre-determined metadata descriptors are the result of processing the source data; these functions are sometimes referred to as data processing functions (DPF). Each data processing function may be associated with a specific timestamp or version for all of the components that result from the processing. This associated timestamp may be included in the key-value store, and may be similar to a version control feature. In some embodiments, this version control feature can allow for version roll back to a previous processed state and/or specific application of rules or data management of a processed dataset. Such timestamps can provide a mechanism to assess how a dataset changed over time as the state of the dataset can be assessed as it was at any point in time.
[0090] In some embodiments, the data can be accessed directly through an application programming interface (API), which can be a set of routines, protocols, and/or tools for building software applications. These direct access requests may occur through a library call for programmatic access in data science or a call through a representational state transfer (REST) API when accessing the data for an application. A query using these examples of direct data access may trigger a distributed routine to collect the data across various nodes. In another embodiment, the data may be accessed through a manufactured dataset, and may use the distributed computing capability of various software tools (Accumulo, Spark, or the like) on the cluster to create batch jobs that use metadata descriptors to assemble the necessary dataset and to generate the dataset into the format requested. In some embodiments, this dataset may be exported to a specified location to meet governance, privacy and/or compliance requirements.
[0091] The process of authorisation regarding data access requests may be simplified for the administration through the use of tags, attributes and, expressions, which may provide administrators with the ability to specify tags, attributes or expressions on the data at a high level. For example, using the Accumulo software will provide users with a visibility field that allows the use of arbitrary attributes such as PHI, PUBLIC and, DEIDENTIFIED, which can then be assigned to users/groups for authorisation. In addition, the use of directory servers, such as Microsoft’s Active Directory (AD), may enable a user to link users/groups to authorisations. In one exemplary embodiment, a customer may define a rule to a group called “researchers” in a specified AD location, such as “researcher authorisation allows you to see data with attributes PUBLIC and DE-IDENTIFIED”. The Accumulo infrastructure allows user attributes identified for users/groups to be defined and used in the same way; this attribute-based access control would authorise users/groups/ AD with particular attributes to access data with particular attributes. In addition, there is a priority order of evaluation for rules in the case where the administrator specifies several rules that overlap.
[0092] In some embodiments, authorisation decisions may be made via a policy engine, such as the open-source Open Policy Agent (OP A) engine, or similar. In such cases, directories may be used only as sources of information or data, and/or as an authentication server (i.e. for validating usemame/password combinations). Thus, the policy engine, when combined with, for example, the Accumulo infrastructure, may be configured to authorise access by assessing a pre-defined policy (e.g. a rule set) based on attributes of the data requestor making the request (i.e. name, group memberships, security clearance, location, etc.), and metadata associated with the data to be accessed (i.e. sensitivity, data owner’s name, time of acquisition, etc.).
[0093] In accordance with one aspect, the employment of a key -value store permits the storage and operation on at least four types of data, collected or derived, when a source data is received or exists in the key -value store: metadata descriptive of the source data (e.g. the source data file itself, file name, file type, file size, etc.), metadata derived from the source data (e.g. patient name data from a the corresponding patient name field within the source data file); metadata derived from the preliminarily derived metadata (e.g. a predetermined category, such as age group where the value for such derived logical row is determined from another existing logical row where the metadata descriptor is age); and governance metadata (e.g. retention policies, authorisation, owner, etc.). In some examples, the metadata derived from the source data may be referred to as the tokenisation of the original data; this refers to any operation to data associated with a data object, including other logical data rows, in order to protect, analyze, or generate new data from the existing source data or generated data at a granular level. This tokenisation can include obfuscation, aggregation, computation, and the application of filters. Employing the metadata, the keyvalue store therefore allows formulation of datasets and access thereto based on context- and requestor-specific characteristics. [0094] Each key -value logical row may be assigned a unique key for identification. In some embodiments, all key -value logical row associated to a given set of source data may be assigned a unique key for identification. In some embodiments, all key-value logical rows associated with a data object may be assigned a unique key for identification. In other words, in some embodiments, when an example of a system as herein disclosed stores source data, it may assign a unique key identifier, grouping the metadata associated to the source data as a single logical entity, or grouping the metadata associated to a data object associated to at least one source data as a single logical entity. Each collected or derived datum, with its unique key, associated metadata descriptors, and corresponding constituent data component value, is stored as a key-value logical row in the key value store. In some embodiments, examples of the metadata descriptors for each collected or derived datum include an accessibility authorisation and/or sensitivity descriptor and time-sequenced information, temporal-/locality-based associations.
[0095] Since key values can be used for, among other reasons, identifying, locating, and securing access to data objects, data can be indexed and accessed based on the existence of certain metadata, (1) data can be quickly accessed and located based on the existence of specified metadata within the key-value store; (2) derived datasets can be generated directly from the key-value stored; and (3) regulatory and administrative compliance can be enforced at a data storage layer (as opposed to at an application layer).
[0096] In some embodiments, the plurality of data storage resources 112 exists in a network communication and are configured for distributed storage of a plurality of data objects, wherein each said data object comprises of a plurality of constituent data object components. An example of the plurality of data objects includes a set of data related to or derived from either unstructured or structured data received by the system as an input. A constituent data object component includes each set of data that forms a part of the data object, and may be generated automatically, derived under system command, or formulated based on unique requests.
[0097] In some embodiments, one or more digital processors 104 have a data object key value store accessible thereto, wherein the data object key value store stores a unique key -value logical row for each constituent data object component. In other words, each constituent data object component is stored in the data object key value store, as a unique key-value logical row.
[0098] Furthermore, each key-value logical row comprises: a key for uniquely identifying the key -value logical row; a constituent data object component value for providing component information relating to the constituent data object component associated with the key -value logical row; and a meta data descriptor for describing a data object component characteristic of the constituent data object component value. An example of a key for uniquely identifying the key-value logical row includes a unique identifier for all the data generated, derived, or formulated from an input received. An example of the constituent data object component value may entail actual values for a given constituent data object component, where an example of a metadata descriptor may include, for instance, names and age.
[0099] The system may derive at least one of the constituent data object components. The system may further employ at least one of the constituent data object component values and derive at least one constituent data object component. In other words, the system may preliminarily derive constituent data object components. Then, using the values of the preliminarily derived constituent data object components, the system may further derive other constituent data object components. This operation may be performed by the system upon requests to the processing component, wherein the request triggers access to constituent data object component values comprising metadata descriptors.
[00100] In some embodiments, each key-value logical row embeds additional management information, such as an access authorisation value for restricting access to the constituent data object component values, in response to requests associated with a corresponding authorisation. This access authorisation value can also be a sensitivity tag or other compliance and/or governance information and/or timestamp information. The access authorisation value or sensitivity tag can correspond with a user identity, user role and/or a user group, restricting access to the constituent data object component values. Some examples of constituent data objects may include restricting access to patient records, financial data, or proprietary, confidential or sensitive data. Some examples of user roles, user identity, or user groups may include doctors, researchers, banks, administrators, and underwriters. In some embodiments, the restriction of the constituent data object component values will be based on governance and/or compliance rules, such as data retention, storage requirements, and data ownership. In another embodiments, rules associated with timestamp information or version control information can be used to restrict access to the constituent data objects. Some examples of using timestamp information may include restricting access to the most recent version of constituent data objects, or limiting access to older versions of constituent data objects.
[00101] In another exemplary embodiment, at least one of the constituent data object components for a given key-value logical row are derived from the input source data automatically upon storing the source data associated with the data object in the data storage components. In one embodiment, the derived datasets may be associated with a set of pre-determined rules, or data processing functions (DPF), which can be used to produce metadata descriptors to the source data or to add timestamp information or version control. The derivation may take place under pre-determined requests, under data access requests, or by system administrator, both at run time or at subsequent times.
[00102] In another embodiment, these rules can be created during ingestion of the data or after the data was already ingested. In some embodiments, these data processing functions (DPF) are developed using a general-purpose programming framework, such as Spark and/or MapReduce, which enables curation functions to be run across the data constituent data objects.
[00103] In accordance with one aspect, there is disclosed a data storage system for generating a context-specific dataset based on a source dataset. A source dataset 114 may include different formats of documents that may be provided to the data storage system. As will be discussed further below, a context-specific dataset may be generated based on the source dataset 114, in accordance with specific requisitions made of the data storage system. [00104] In some embodiments, the plurality of network-accessible hardware storage resources is in network communication and configured for distributed storage of data objects. The data objects may include any type of data obtained, derived, formulated, and/or related to the source data itself upon the receipt of the source data by the data storage system. The digital data processor responds to data access requests received over a network relating to the data objects. Data access requests related to data objects stored in the data storage system may come from end-users. The key-value store is stored in the hardware storage and may be composed of a unique key -value logical row for each constituent data component of each of the data objects in the data storage system. In accordance with one aspect, a data storage system may contain a number of data objects, which may be composed of constituent data components related to a source dataset. In some embodiments, a set of data objects, or a data object, may be related to a source dataset provided to the data storage system. The data object may be composed of constituent data components that were received, derived, or formulated at the time of, or subsequent to the receipt of the source data at the data storage system. These constituent data components may include various characteristics and/or information related to the source data itself, the data derived from the source data, or the data formulated from the source data or derived from the source data under given requisitions.
[00105] Each said unique key-value logical row may comprise a key for uniquely identifying the unique key -value logical row, a constituent data component value, and/or a metadata descriptor. In some embodiments, the key for unique identification of the unique key-value logical row may be a value comprising stored digital information. In some embodiments, the key may be formulated from the constituent data component associated with the key-value logical row and a metadata descriptor. In some embodiments, the key may be a combination or combinations of constituent data component values and metadata descriptors. The constituent data component values may comprise stored digital information relating to the constituent data component associated with the unique keyvalue logical row. This digital information may be a value directly obtained, derived, or formulated from the source data received. In some embodiments, the digital information may store a value indicative of location of where the actual value is stored. Examples of the digital information include actual first name (e.g. John) and a pointer value to a designated location in a data storage. The metadata descriptor may describe metadata of the constituent data component value. Metadata may generally comprise data information that provides information about other data. In some embodiments, metadata describes a resource for purposes such as discovery and identification, including elements such as title, abstract, author, and keywords.
[00106] In accordance with one aspect, metadata describes containers of data and indicates how compound objects are put together, non-limiting examples of which may include types, versions, relationships, and/or other characteristics of digital materials. In some embodiments, metadata provides information to help manage a resource, such as when and how it was created, a file type or other technical information, and/or who can access it.
[00107] In accordance with one aspect, at least one key-value logical row for a given data object is directly associated with the dataset, and at least one key -value logical row for the given data object is derived from one or more other key -value logical rows. Examples of directly associated key-value logical rows may include the data obtained at the time of the receipt of the source data, such as file name and file type, and the data derived at the run time or at subsequent times, such as first name and last name. In some embodiments, key-value logical rows derived from one or more other key-value logical rows may be derived based on end-user requisitions. In some embodiments, the key-value logical row derived from one or more other key-value logical rows may be derived based on data administrator of the data storage system.
[00108] In accordance with one aspect, in response to a given data access request based on a given metadata descriptor, one or more digital data processors 104 generates an independent dataset via a key -value store by accessing those key -value logical rows having metadata descriptors responsive to the data access request. In some embodiments, the given metadata descriptor may be pre-determined by the system administrator or customised metadata by the end-user. In some embodiments, the metadata descriptors may include metadata descriptors created at the run time, at subsequent times when the key -value logical rows were derived or formulated, or when a requisition based on the given metadata descriptor is made.
[00109] In some embodiments, a key-value logical row comprises an access authorisation value for restricting access to the corresponding key-value logical row. In accordance with one aspect, the access authorisation value may be stored digital information. In accordance with one aspect, the access authorisation value may be a combination or combinations of constituent data component values and metadata descriptors. In some embodiments, the access authorisation value may be employed for generation of the independent dataset, in response to a given data access request, allowing control over the information accessed and the independent dataset generation.
[00110] Examples of factors that may be associated with the access authorisation include a requesting user identity, a requesting user role, a requesting user group, the constituent data component of the corresponding key -value logical row, the source datasets from which the corresponding key-value logical row originated, and/or the metadata descriptor of the corresponding key-value logical row. In some embodiments, access authority may be determined at the time of the data access request, or at subsequent times, based on the above-noted factors.
[00111] In accordance with one aspect, at least some of the key-value logical rows are automatically generated from the source dataset upon importing such source dataset 114 into the data storage system. Examples of the key-value logical rows that are automatically generated from the source dataset upon importing the source dataset into the data storage system include file name and file type. In some embodiments, some of the derived keyvalue logical rows are derived upon a request for such derivation by a user of the data storage system. Examples of the key-value logical rows that are derived upon a request for such derivation by a user of the data storage system include first name and last name. In some embodiments, an additional key-value logical row is derived by obfuscating the constituent data component value of at least one existing key -value logical row to generate the constituent data component value of the additional key value logical row, and the corresponding metadata descriptor of the additional key -value logical row being generated based on such obfuscation. Examples of obfuscation may include a deliberate obscuring of a birth date (e.g. a birth date of April 11, 1988 is obfuscated to an age of 30 to 35 years old, or a birth year of 1988, or the like) so as to not disclose the precise age, but place other key -value logical rows related to the same source data as the mentioned age key -value, and make available for data access requisition with access authority a generated independent dataset.
[00112] In accordance with one aspect, an additional key-value logical row may be derived by aggregating the constituent data component values of at least two existing keyvalue logical rows to generate the constituent data component value of the additional keyvalue logical row, and the corresponding metadata descriptor of the additional key-value logical row being generated based on such aggregation. Examples of aggregation include aggregating first name and last name to formulate an additional key -value logical row, with the corresponding metadata descriptor name. In some embodiments, examples of aggregating include aggregation of key -value logical rows related to a data object associated to a source data.
[00113] In accordance with one aspect, an additional key-value logical row may be derived through a function-based calculation based on the constituent data component values of at least one existing key-value logical row to generate the constituent data component value of the additional key -value logical row, and the corresponding metadata descriptor of the additional key -value logical row being generated based on said functionbased calculation. Examples of said function-based calculation may include a decisionmaking scheme, mathematical function, or other rules to come up with additional key -value logical row (and the corresponding metadata) based on existing key-value logical rows.
[00114] As will be discussed further below, in some embodiments, additional key -value logical rows may be derived by obfuscating the constituent data component value of at least one existing key-value logical row to generate the constituent data component value of the additional key-value logical row, and the corresponding metadata descriptor of the additional key-value logical row, by obfuscation. The access authorisation value of the additionally derived key-value logical row may be the same as the existing key-value logical rows, from which the additional key -value logical row was derived, or different. In some embodiments, the access authorisation for additional key-value logical rows may be pre-determined in association with one or more of the following: a requesting user identity, a user role of a requesting user group, the constituent data component of the corresponding key -value logical row, the source datasets from which the corresponding key -value logical row originated, and/or the metadata descriptor of the corresponding key -value logical row. In some embodiments, the access authorisation for additional key-value logical row may be determined by the system administrator or data access requestor.
[00115] In one exemplary embodiment, there is provided a system that consists of two manager nodes plus Hadoop-based cluster nodes, wherein each Hadoop-based cluster node in this exemplary system may comprise computing devices that may be classified as either or both Hadoop master nodes and Hadoop data nodes. It should be noted that, in other embodiments, there may be one manager node, or a plurality thereof; in either case, the master node functionalities described below may be carried out by a single master node, or distributed in various manners across the plurality, and that the subject matter hereof is not limited to systems with two manager nodes. A manager node may carry out the following functions: runs any centralised applications that manage the data storage and access functions (including management of the key-value store); provides the web and/or other (e.g. REST) interface for data administration, privacy, security, and governance functions; hosts the any web, proxy, or other server functionality (e.g. NGINX); manages and runs the master key distribution for administrators or other service principals (e.g. MIT Kerberos Key Distribution Center); runs data analysis or data function applications or libraries (e.g. the PHEMI Data Science Toolkit, Spark, and Zeppelin); manages and runs the slave key distribution for administrators or other service principals (e.g. MIT Kerberos Key Distribution Center); and hosts backup components for any other manager node in case of critical failure thereof.
[00116] Returning to Figure 1, and in accordance with some embodiments, a data requestor or consumer 110 may include any individual or institution requesting access to source dataset 114. This may include, for example and without limitation, a data researcher desiring access to a source dataset 114 for research purposes, or the like. [00117] In some embodiments, the plurality of data storage resources 112 may be configured to leverage the capability of a key-value store. Some embodiments may utilise one or more hardware storage devices 112, each of which may in turn comprise storage sub-elements (e.g. a server comprising a plurality of storage blades that each in turn comprise multiple storage elements of the same or different types, such as flash or disk). Very large datasets may be distributed amongst many different local or remote storage elements; they may be closely stored (e.g. on the same device or on directly connected devices, such as different blades on the same server), or they may be highly disparately and remotely stored (e.g. on different, but networked, server clusters). Furthermore, the data stored may be duplicated for a number of reasons, including redundancy and failure handling, as well as efficiency (e.g. to store a copy of information that has been recently used "close" to other required data). Systems and methodologies for managing such large data and complex sets have been developed (e.g. HDFS for Hadoop™). Thus, in some embodiments, each of said storage resources 110 are generally configured for storing, accessing, and using very large datasets using a key-value store to ingest and store data from a source dataset 114 (e.g. a patient or financial record) in a highly granular fashion.
[00118] In some embodiments, the data can be accessed directly through an application programming interface (API), which can be a set of routines, protocols and, tools for building software applications. These direct access requests may occur through a library call for programmatic access in data science or a call through a representational state transfer (REST) API when accessing the data for an application. A query using these examples of direct data access may trigger a distributed routine to collect the data across various nodes. In another embodiment, the data may be access through a manufactured dataset and use the distributed compute capability of software tools, such as Accumulo and/or Spark, on the cluster to create batch jobs that use metadata descriptors to assemble the necessary dataset and to generate said dataset into the format requested. In some embodiments, this dataset may be exported to a specified location to meet governance, privacy and/or compliance requirements.
[00119] As shown in Figure 2A, once a data access request is received from data requestor 110 for any data of said source dataset 114, requestor 110 may be given, instead of raw source data, a view of or access to a context-specific dataset 206, wherein the context-specific dataset 206 is generated (or pre-generated) based on a permitted access permission 208, such as a trust factor associated with the requestor and/or a likelihood of re-identification 210 of the constituent data element values of the data objects contained in the context-specific dataset 206.
[00120] Generally, context-specific dataset 206 comprises one or more data objects (e.g. context-specific data objects 220) that are representative, at least in part, of a source dataset 114, but wherein the constituent data element values therein that are restricted, removed, replaced or obfuscated based on a permitted access privilege (e.g. re-identification risk value 208 specific to this particular data usage instance of data requestor 110 characterising the trust level, and/or an estimated likelihood of re-identification 210 that a given accessed data object can be associated with one of the identifiable data subjects 122). As shown in Figure 2B, each data object therein (e.g. context-specific data object 220) may comprise a combination or interleaving 222 of different types of data component values, including genuine data component values 120 originally included in source dataset 114, but also de- identified data component values 260 and/or synthetic data component values 270, both of which being, at least in part, derived from said genuine data component values 120, as will be discussed below. In some embodiments, the relative number (e.g. a designated number, amount, fraction, type, quantity, or the like) of genuine data component values 120, de- identified data component values 260 and/or synthetic data component values 270 in said context-specific dataset 206 may have an influence on its associated estimated likelihood of re-identification 210. Further, it will be appreciated that some source data component values may only be accessible to certain requestors 110, and/or roles or re-identification risk factors 208 thereof. For example, medical images may only be accessible to a doctor (rather than, for instance, a hospital administrator), or a data scientist with a designated reidentification factor and/or who is requesting data in the final stages of a research lifecycle.
[00121] In some embodiments, a data usage instance refers to the circumstances, requirements, restrictions, and characteristics associated with a use or analysis of a contextspecific dataset (and/or the underlying source data). While the data analysis task and even the underlying source data and/or context-specific dataset may remain similar from one institution to the next, the data usage instance for each may be very different. For example, different privacy requirements may apply due to different legislative, regulatory, or policy requirements; different data sensitivity may apply for certain groups of individuals in different institutions; different research ethics boards may impose different requirements over the analysis or use of the data; different persons or organizations may be carrying out the data analysis or use in respect of which different data usage controls or restrictions may be in place. The particular combination of such circumstances, including for example, requirements, restrictions, and characteristics associated with a given data analysis, at a given institution (or group or class of institutions), in association with particular source data at a particular time, as well as other such factors, together form a data usage instance, including an actual and permitted risk of re-identification associated therewith. The foregoing examples are provided so as to illustrate different possible circumstances that would give rise to a particular data usage instance; different factors may apply (and the foregoing may not) so as to give rise to such a particular data usage instance. In addition, a data usage instance may refer to a single data request or, to the extent that a similar set of applicable circumstances apply, a plurality of data requests.
[00122] In some embodiments, a data subjects may refer to person, place, thing, or set of conditions and/or characteristics to which a data object applies. For example, it may refer to an individual (e.g. a patient, customer, insured individual, bank customer); a legal person or association (e.g. a business, corporation, individual, joint venture, etc.); a set of circumstances (e.g. weather conditions at a particular time and place); or other tangible, intangible, or ephemeral person, place, or thing, or characteristics associated therewith.
[00123] In some embodiments, de-identified data component values 260 may be values derived from corresponding genuine data component values 120, whereby the information has been obfuscated, at least in part, so as to render it more difficult to identify the person or entity it pertains to. Different levels of de-identification may be applied based, at least in part, on a target value of the estimated likelihood of re-identification 210. In some embodiments, stronger de-identification methods may reduce the precision of the original source data component value. In accordance with some embodiments, such de- identification may relate to differential privacy processes. [00124] In some embodiments, synthetic data components values 270 are fictitious data component values preserving, at least in part, one or more relationships between the corresponding the genuine data components 120 from which they are derived. In some embodiments, this may include fictitious non-numerical values including textual values (e.g. names, addresses, etc.) that are representative in some way of the corresponding genuine data component values in source dataset 114 (e.g. a conventionally male name, an address from a related area or zip code, or the like). In some embodiments, tabulated numerical values (e.g. medical test values, account numbers, etc.) may be generated that are representative one or more statistical relationships between the corresponding genuine data component values. For example, this may include, in some embodiments, learning the joint probability distribution of at least a portion of the genuine data component values 120, and generating therefrom corresponding synthetic data component values 270 having the same or a comparable and/or related probability distribution.
[00125] In some embodiments, synthetic data component values 270 may be pregenerated based on said genuine data component values 120 before the data access request is received. In some cases, this pre-generation may be performed before the contextspecific dataset 206 is generated. In accordance with different embodiments, different types of synthetic data component values may be generated, including text and numerical data. It will be appreciated that various methods or techniques for generating distributions or ensembles of such synthetic data element values may be used, without restriction. Such methods may include, for instance, machine learning (ML) methods, and/or generative models such as Generative Adversarial Networks (GANs).
[00126] In some embodiments, when interleaved de-identified data component values 260 and synthetic data component values 270 are present in a context-specific dataset 206, the likelihood of re-identification 210 for the dataset may be estimated only from the de- identified component values 260 as, by themselves, synthetic data component values 270 may have no re-identification risk associated therewith. However, in some embodiments, the determination or estimation of the likelihood of re-identification 210 of the contextspecific dataset 206, or elements thereof, may also take into account, at least in part, the presence or number of synthetic data component values therein. Thus, in some embodiments, the number of synthetic data component values 270 in the context-specific dataset 206 may be based on at least one of the following: the permitted re-identification risk value 208 (e.g. trust level between data owners and requestors), and the estimated likelihood of re-identification 210.
[00127] In some embodiments, access to said context-specific dataset 206 may employ one or more of various authorisation approaches, a non-limiting example of which may include Attribute-based Access Control (ABAC), which has been shown to be highly flexible and scalable.
[00128] In some embodiments, the context-specific dataset 206 based on or derived from source dataset 114 may be generated upon receipt of data access requests by endusers (e.g. data requestor 110). Examples of the data access requests may include specific requests for age range data for all of the data objects in the data storage. Examples of network-accessible hardware storage resources may include spinning disks connected for distributed data storage. A method for generating context-specific datasets 206 may comprise storing a key-value store in one or more said hardware storage resources, directly generating at least one of the key -value logical rows for a given data object from source data (e.g. raw personal) data, deriving at least one key-value logical row for the given data object from other key -value logical rows, and generating, in response to a data access request and based on one or more metadata descriptors, an independent dataset via the keyvalue store by accessing those key -value logical rows having metadata descriptors. In some embodiments, the key-value store may comprise a unique key-value logical row for each constituent data component of each data object. Constituent data components of each data object, with each data object related to at least one source data, may include information about the source data, such as a file name and file type, information derived from the source data, such as first name and last name, and information formulated through aggregation, employing function-based calculations, or responding to data access requests. Each keyvalue logical row may comprise a key for uniquely identifying the key-value logical row, a constituent data component value comprising stored digital information relating to the constituent data component associated with the key-value logical row, and a metadata descriptor describing metadata of a data component value. The key for unique identification may be stored digital information, which may be a combination or combinations of constituent data component values and metadata descriptors describing metadata of a data component value. The constituent data component may be an actual value or a pointer to the location of the storage where the actual value is stored. The key, the constituent data component, and/or the metadata descriptor may be created, derived, or formulated at run time, at subsequent times, pre-determined, upon data access requests, and/or by system administrator, in accordance with various embodiments.
[00129] In accordance with one aspect, at least one of the key-value logical rows for a given data object may be derived from other key -value logical rows. The derivation may take place under pre-determined requests, upon data access requests, or by system administrator, and/or both at run time or at subsequent times. In some embodiments, a data access request is a request for data, which may be automatic, pre-determined, or userspecific. For example, the data access request may be made by an end user or system administrator. In another example, the data access request may be received at the run time, or at subsequent times when the data object exists in the system.
[00130] In some embodiments, a same key-value logical row (i.e. names, addresses, test values, account values, etc.) for all of context-specific data objects 220 may comprise only one of genuine data component values 120, de-identified data component values 260, or synthetic data component values 270. In some embodiments, a context-specific dataset 206 may comprise multiple key -value logical rows, each having values chosen or selected from the genuine data component values 120, de-identified data component values 260 and/or synthetic data component values 270.
[00131] In some embodiments, the risk or likelihood of re-identification 210 of a dataset or collection (e.g. context-specific dataset 206) may be assessed by determining the likelihood or probability that a given set, row, or value can be correlated to an identifiable individual or subject (e.g. identifiable data subject 122). In some embodiments, a given derived dataset can be associated with a risk or likelihood of re-identification 210, wherein such a risk provides an indication of a probability that any given data object within the key value store that is made part of a derived dataset can be associated with an identifiable individual or subject to which the data object pertains. The higher such probability, the greater the risk re-identification indication. This risk indication may also be increased depending on the nature of the data object. For example, if the data object comprises sensitive personal information, such as, but not limited to, personal health or personal financial information, a factor associated with the risk of identification may be increased. In general, the risk of re-identification will decrease if information that is specific to an individual can be withheld from a dataset or obfuscated within a dataset. To the extent that this does not impact the informational value a dataset, or minimally impacts the informational value of a dataset, the re-identification risk can be used to optimally provide informational value while protecting the identity of the subjects of the information within the dataset, in accordance with some embodiments.
[00132] In some such embodiments, the re-identification likelihood or risk 210 is a measurement of (a) the likelihood that any data object, data component values, or a collection thereof, can be linked or associated with the subject or subjects to which it pertains (e.g. identifiable data subject 122). In accordance with some embodiments, the number of same or similar data components within a dataset or other collection (that may or may not refer to other subjects) can be used to provide such an assessment of reidentification risk. In some embodiments, the assessment can provide the ^-anonymity property of a given dataset, although other methods of assessing re-identification risk that may be known to persons skilled in the art can be used, including /-closeness, /-diversity, and privacy differential, ^-anonymity is a property of a given datum, or set of data (including one or more rows) indicating that such datum or set of data cannot be distinguished from k-1 corresponding data or sets of data; an assessment of ^-anonymity may be applied in respect of a particular field or type of metadata in a dataset. The k- anonymity property of data is described in Samarati, Pierangela; Sweeney, Latanya (1998). "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalisation and suppression", Harvard Data Privacy Lab, which is incorporated herein by reference, /-closeness, /-diversity, and privacy differential utilise statistical models to provide an indication of similarity between a given data component within a dataset that is used to calculate a risk of re-identification. See Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian (2007). "t-Closeness: Privacy beyond k-anonymity and 1- diversity", ICDE, Purdue University; and Dwork, Cynthia (2006). "Differential Privacy” ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II, Pages 1-12, which is incorporated herein by reference.
[00133] In some embodiments, a risk or likelihood of re-identification is assessed for a given dataset, wherein an acceptable threshold may be applied for a given dataset and/or in respect of a particular field or type of metadata within such dataset. For example, for a dataset comprising personal health information (“PHI”) and non-PHI, a re-identification risk in respect of the PHI data may be provided for the dataset, as well as another reidentification risk in respect of the non-PHI data. In another example, for any value that is or includes PHI or other sensitive information (e.g. personal financial, insurance, or other sensitive information), different acceptable threshold risks of re-identification may be applicable than for datasets that do not include PHI.
[00134] In some embodiments, upon generating a derived dataset (e.g. context-specific dataset 206), a risk or likelihood of re-identification 210 may be determined for the derived dataset. In other embodiments, the re-identification risk 210 may be determined thereafter. Depending on the determined risk, as well as other factors, the dataset may be made available to particular users. This availability may be a function of sensitivity of values on the dataset (e.g. whether it contains PHI or personal financial information (“PFI”)), or the risk of re-identification (e.g. likelihood of re-identification 210), or the role or trust-level (e.g. permitted re-identification risk factor 208) of the person/entity to whom the dataset is being made available (e.g. physician, researcher, bank teller, etc.), or the nature of data availability (e.g. transmission of a new dataset, or access to a centralised repository), or the location of the user (e.g. remote laptop, remote server, server room, etc.), or a combination thereof.
[00135] In some embodiments, the re-identification risk or estimated likelihood or reidentification 210 may be associated with the concept of zones of trust, or location-based de-identification controls. In general, when datasets are de-identified, the dataset is sent to (or made available to) approved targets, without reference to the location of the target or the security features/risks associated with the target’s location. This may expose a potential risk of re-identification. In some embodiments, a Risk Acceptability Threshold (RAT) may be used based on a determination of the specific risks associated with the circumstances of a data usage instance. For example, a data usage instance may relate to circumstances including a risk or sensitivity associated with the dataset, which may relate to one or both of a re-identification risk and/or the sensitivity of such data,, an indication of user trust (e.g. the permitted re-identification risk value 208 relating to a level of authorisation or trust associated with a given user or entity in association with, in accordance with some embodiments, a sensitivity or sensitivities of the dataset), and/or a location-based and/or security-based risk assessment of the computing devices to where the dataset is to be provided, which may include associated or intermediary computing devices (e.g. if a computing device is highly secure, but data must be transmitted or conveyed thereto via less secure intermediary devices, this may be taken into consideration, in accordance with some embodiments).
[00136] For example, RAT may be determined as Max(Dataset risk, User trust, Location controls). An exemplary process, in accordance with embodiments hereof, may include: (1) optionally first determining a RAT associated with a particular collection of data; (2) applying de-identification or obfuscation to specific fields in accordance with methods disclosed hereunder to generate a de-identified dataset; (3) calculating the risk for each record (e.g. data component) in the dataset using a re-identification risk calculation process (e.g. ^-anonymity determination algorithm); (4) applying a filter to the data to meet a designated Risk Acceptability Threshold; and/or (5) restricting the dataset destination to only those targets that meet the Risk Acceptability Threshold.
[00137] In accordance with some embodiments, the location-control indication may be a pre-determined value associated with specific types of locations, or it may be determined in an ad hoc manner based on access or security characteristics associated with a specific location. For example, if a given dataset is associated with a 10% RAT, the dataset could be restricted to locations that meet the necessary location-control indication. In such an example, a managing entity (e.g. PHEMI Central) may restrict target-locations such that data characterised with a 10% RAT can only be sent to a secure research environment, and not, for example, downloaded to a user’s laptop. In contrast, another dataset that may be de-identified to a 1% RAT may, in accordance with some embodiments, be downloaded to a user’s laptop. In some embodiments, the location-control indication may be associated with a “zone of trust”, within which, possibly based on the security and/or ability for third- parties to access, may allow for the provision of more sensitive or risk-prone datasets. Such zones of trust may be determined in advance or dynamically depending on criteria relating to security or to indications of such security. Either such case (i.e. pre-determined or dynamically determined based on criteria and/or circumstances), may, in accordance with various embodiments, constitute a designated zone of trust.
[00138] In some embodiments, there are provided systems and methods for dynamically deriving additional data components associated with an existing dataset that modify a reidentification risk. For example, if a given dataset includes data components that present a given ^-anonymity property (or other re-identification risk determination) that is too high for release to, or use by, a given user (or at a user location), additional data components may be derived for a different dataset that, while relating to the same data objects, increase the ^-anonymity score. This might include replacing all data components appearing within the dataset that include an age with a data component that indicates a date range. While this may minimally reduce the informational effectiveness for a researcher, for example, it may nevertheless significantly reduce the re-identification risk. In some embodiments, the possible users, locations, and/or user-location combinations that can access or have the dataset delivered thereto may be accordingly increased. Since there is a metric (e.g. RAT) applied to dataset risk, user trust, and location-risk, the system can automatically derive further obfuscated data components for generating new datasets. In some embodiments, the user can indicate which fields should be preferentially obfuscated (or further obfuscated) so as to minimally impact informational effectiveness.
[00139] In some embodiments, selectively fulfilling a data request means that a request may or may not be fulfilled. The request may be fulfilled in some embodiments, for example, when a risk of re-identification, as indicated by the re-identification risk value associated with a data request, is lower than would be required under the circumstances. Such circumstances may include, but are not limited to: the types of sensitivity (which may be referred to in some cases as an authorisation level) associated with the data being returned in response to a data request; whether or not the request has originated from, or the data is being provided to or accessed from, a designated zone of trust; and/or the identity, role, or other characteristic of the individual or entity making the data request. Notably, selectively fulfilling a data request includes circumstances where the contextspecific dataset may not be provided. In such cases, some, but not all embodiments may result in further actions, including, but not limited to, dynamically creating new datasets based on other key-value logical rows that have been further obfuscated, dynamically creating new but further obfuscated key-value logical rows, or limiting distribution to (or access from) certain types of designated zones of trust.
[00140] In addition to the above-described features, system 100, in accordance with different embodiments, may be further configured to provide improved data management features of real, de-identified and/or synthetic data. For example, this may include, without limitation:
• Data security - Making sure that data is protected from unauthorised access or modification;
• Access control - Ensuring that only authorised parties are able to work with a dataset;
• Audit - Recording all steps that transform data as well, as well as every attempt to access data;
• Provenance - Recording exactly how a dataset came to be (e.g. tracking ancestor datasets and the transformations made thereto to derive the current dataset);
• Versioning - Persisting earlier versions of files so that a user can choose a particular instance from a specific time;
• Copy control - Controlling the ability of an individual to make and distribute copies of the data;
• Expiry management - Automatic deletion of data after a designated time limit;
• Indexing - Creating indexes in data to speed up queries of large datasets;
• Cataloguing - Recording what data exists so that data consumers can find the right dataset;
• Data discovery - Helping data consumers find what data is available; and
• Metadata management - Managing data that describes and augments datasets, such as the time and location it was acquired. Metadata can describe data at varying levels of granularity (aggregate dataset, file, column, row, cell, etc.).
Examples
[00141] A first example of an evolving research program using system 100, in accordance with one embodiment, is schematically illustrated in Figure 3. This example shows an exemplary dataset related to cholesterol (e.g. a source dataset 114 comprising, for instance, real cholesterol values for various subjects) being used in a governed research study. In this embodiment, access to this dataset is given to a researcher (e.g. data requestor 110) and is managed via system 100. Beginning at the bottom of the Figure 3, a researcher may first have only access to low-risk synthetic data 302 (e.g. a context-specific dataset 206 comprising only synthetic data component values 270). The researcher may then transition, in accordance with the permission granted by the data owner of the exemplary dataset, through various stages of access to different datasets having associated therewith gradually higher privacy risks. Thus, for subsequent data access requests, the researcher may gradually have access to a context-specific dataset 304 (e.g. context-specific dataset 206) comprising interleaved de-identified data component values 260 and synthetic data component values 270), and later to a context-specific dataset 306 comprising only de- identified data component values 260. Eventually, and in accordance with embodiments, the researcher may be granted or otherwise have access to real data 308 (e.g. the contextspecific dataset 206 comprising only genuine data component values 120) with the highest privacy risk (top of Figure 3).
[00142] In accordance with some embodiments, the research lifecycle of data may be characterised by increasing levels of trust invested in a research project by a data owner or administrator. In such embodiments, the goal may to promote research, while managing the risk to patient privacy. For example, in early phases of research, a data requestor or consumer (e.g. a researcher) in a new research program may only be given access to synthetic data. They may not yet have sufficient clearance to access highly sensitive PHI, and they may not yet have a mature process in place for protecting sensitive data. By restricting access to synthetic data, a data owner may minimise the risk that data is exposed and therefore triggering a privacy violation. This phase may, in accordance with some embodiments, be dominated by experimentation and iterative algorithm development. Synthetic data is very well suited to a researcher in this phase. As their work progresses, the researcher may want to progress to working with a hybrid of synthetic data and real, but de-identified data. This is a logical step as methods mature, progressing the researcher towards real data while minimises the risk to patient confidentiality.
[00143] In accordance with some embodiments, a further research step may comprise allowing a researcher to access and/or use a full dataset of real, but de-identified data, thereby allowing the researcher to, for instance, measure and account for any artificial effects arising from the use of solely synthetic data and that are not mirrored in a real dataset. The risk to patient privacy in such a step is low, as data is still de-identified; however, it will be appreciated that the risk to patient privacy is not zero, as de-identified data may still be re-identified (i.e. a unique person may still be associated with a previously anonymised record) via statistical techniques, as well as via correlation with publicly available datasets. As discussed above, de-identification may be tuned, in accordance with various embodiments, using a risk score that quantifies the potential for re-identification (e.g. which may be part of the likelihood of re-identification 210 of the resulting contextspecific dataset 206).
[00144] In accordance with various embodiments, a final step for a researcher may comprise validation using a real dataset. At this point, a researcher may be assumed, for instance, to have developed a well-articulated hypothesis that is proven against de- identified data. The last step may therefore comprise establishing that no artefacts have been introduced to a result through the use of de-identified data, or the process of deidentifying data. Additionally, or alternatively, the relationship between the researcher and the data owner is mature, and the data owner may trust that the researcher’s processes for working with data are rigorous and will not lead to privacy violations. The researcher may accordingly be certified for accessing genuine and/or sensitive data.
[00145] Thus, as illustrated in this example, system 100 empowers the data administrator to control the movement of a researcher through each step of research. System 100 has the feature that it automatically provides a view of data tailored to the entitlements (e.g. the permitted re-identification risk value 208) of a researcher at that moment or that research stage, in accordance with some embodiments.
[00146] Similarly, Figure 4 schematically illustrates another example of a researcher interacting with data in a single virtual location. A privacy-preserving data management system (e.g. system 100) automatically provides a view (e.g. different context-specific datasets 206) in accordance with a current user entitlement. In this example, the platform manages multiple pre-generated columns (dataset 402) with different risk profiles. Upon receiving a data request, the system automatically provides access to the correct concrete column instance for a researcher’s entitlement (e.g. the context-specific dataset 206 for this particular source data usage instance). Thus, initially, dataset 404 is provided, which comprises only synthetic data component values. A second data usage instance may have access to dataset 406 enabled, wherein the second dataset 406 comprises both synthetic data component values and de-identified component values. A third data usage instance may have access to dataset 408, which comprises only de-identified data component values (e.g. no longer comprises synthetic values). Finally, upon the researcher being fully trusted, dataset 410 may be provided, wherein the dataset 410 corresponds to genuine data component values of the source dataset. This removes from the research (e.g. data requestor 110) any burden of data management, putting it instead into the hands of a qualified data administrator (e.g. data owner 116).
[00147] Thus, at all times in the examples given above, the data requestor 110 interacts with data from a single source. The data requestor 110 does not have to manage multiple different files representing the different phases of the data request process (which adds a massive administration burden, as well as likelihood for errors and privacy violations). Instead, the data requestor 110 always interacts with data from one location, and receives a “live” view of the data that is tailored to their current entitlement(s) (e.g. the contextspecific dataset 206).
[00148] In some embodiments, system 100 may be configured to adjust views (e.g. access to different instances of context-specific datasets 206) based on various factors, such the identity of the requestor 110, their relationship with the data owner 116, the phase of research, and/or the characteristics of the data (e.g. data sensitivity, when data was acquired, the data owner’s name, or the like). In addition to views of data (e.g. access to context-specific dataset 206), various other elements of data governance may be accommodated for. For example, data requestor 110 can locate datasets, request access, use earlier versions, etc., in accordance with various embodiments.
[00149] While the present disclosure describes various embodiments for illustrative purposes, such description is not intended to be limited to such embodiments. On the contrary, the applicant's teachings described and illustrated herein encompass various alternatives, modifications, and equivalents, without departing from the embodiments, the general scope of which is defined in the appended claims. Except to the extent necessary or inherent in the processes themselves, no particular order to steps or stages of methods or processes described in this disclosure is intended or implied. In many cases the order of process steps may be varied without changing the purpose, effect, or import of the methods described.
[00150] Information as herein shown and described in detail is fully capable of attaining the above-described object of the present disclosure, the presently preferred embodiment of the present disclosure, and is, thus, representative of the subject matter which is broadly contemplated by the present disclosure. The scope of the present disclosure fully encompasses other embodiments which may become apparent to those skilled in the art, and is to be limited, accordingly, by nothing other than the appended claims, wherein any reference to an element being made in the singular is not intended to mean "one and only one" unless explicitly so stated, but rather "one or more." All structural and functional equivalents to the elements of the above-described preferred embodiment and additional embodiments as regarded by those of ordinary skill in the art are hereby expressly incorporated by reference and are intended to be encompassed by the present claims. Moreover, no requirement exists for a system or method to address each and every problem sought to be resolved by the present disclosure, for such to be encompassed by the present claims. Furthermore, no element, component, or method step in the present disclosure is intended to be dedicated to the public regardless of whether the element, component, or method step is explicitly recited in the claims. However, that various changes and modifications in form, material, work-piece, and fabrication material detail may be made, without departing from the spirit and scope of the present disclosure, as set forth in the appended claims, as may be apparent to those of ordinary skill in the art, are also encompassed by the disclosure.

Claims

CLAIMS What is claimed is:
1. A privacy -preserving data management system for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, the system comprising: a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of a source dataset, said source dataset comprising a plurality of source data objects each comprising constituent genuine data component values that are associated with a corresponding data subject; a digital data processor for receiving and responding to the data request, the digital data processor being communicatively linked to a network via a communication bus, said digital data processor configured to: generate a plurality of synthetic data component values preserving, at least in part, one or more relationships between said genuine data component values amongst at least some of said plurality of source data objects; store said plurality of synthetic data component values; in response to the data request, the data request having a permitted access privilege specific to a particular data usage instance associated therewith, generate a context-specific dataset, wherein said context-specific dataset corresponds at least in part to said plurality of source data objects and comprises at least some synthetic data component values depending on, at least in part, said permitted access privilege.
2. The system of claim 1, wherein said synthetic data component values are generated using a generative model.
3. The system of either one of claim 1 or claim 2, wherein said digital processor is further configured to:
49 generate de-identified data component values corresponding to at least some of said genuine data component values; and replace in said context-specific dataset at least some of said genuine data component values with the corresponding said de-identified data component values depending on, at least in part, said permitted access privilege.
4. The system of claim 3, wherein replacing some of said genuine data component values with the corresponding said de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on said permitted access privilege.
5. The system of any one of claims 1 to 4, wherein said context-specific dataset comprises at least some of said genuine data component values of said source dataset.
6. The system of any one of claims 1 to 5, wherein said context-specific dataset is generated before the data request is received.
7. The system of any one of claims 1 to 6, wherein said permitted access privilege is based on an estimated likelihood a given data object in said context-specific data set can be associated with one of the identifiable data subjects.
8. The system of claim 7, wherein said permitted access privilege is based on one or more access permissions associated with said data requestor.
9. The system of any one of claims 1 to 8, wherein the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on said permitted access privilege.
10. The system of any one of claims 1 to 9, wherein each of said network-accessible hardware storage resources further comprises a key -value store configured to store a unique key -value logical row for each of the data objects.
50
11. The system of claim 10, wherein each said key-value logical row comprises a key, a metadata descriptor, and a data object identifier.
12. The system of either one of claim 10 or claim 11, wherein said key-value logical row comprises at least one of authorization information, data sensitivity information, or timestamp information.
13. The system of any one of claims 10 to 12, wherein said key -value logical row comprises a key-value logical row access authorisation value for restricting access to the corresponding key-value logical row, said authorisation value based at least in part on said permitted access privilege.
14. A computer-implemented privacy-preserving data management method, automatically implemented by one or more digital processors, for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, the method implemented on a data management system comprising a digital processor for receiving the data request and a plurality of network-accessible hardware storage resources each being in network communication and configured for distributed storage of a source dataset comprising a plurality of source data objects, each of said plurality of source data objects comprising constituent genuine data component values that are associated with a corresponding data subject, the method comprising: generating a plurality of synthetic data component values at least in part preserving one or more relationships between the genuine data component values amongst at least some of the source data objects; storing said plurality of synthetic data component values; in response to the data request, the data request having a permitted access privilege specific to a particular data usage instance associated therewith, generating a context-specific dataset;
51 wherein said context-specific dataset corresponds at least in part to the plurality of source data objects and comprises at least some said synthetic data component values depending on, at least in part, said permitted access privilege.
15. The method of claim 14 wherein said synthetic data component values are generated using a generative model.
16. The method of either one of claim 14 or claim 15, further comprising: generating de-identified data component values corresponding to at least some of said genuine data component values; and replacing in said context-specific dataset at least some of said genuine data component values with the corresponding said de-identified data component values depending on, at least in part, said permitted access privilege.
17. The method of claim 16, wherein replacing some of said genuine data component values with the corresponding said de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on said permitted access privilege.
18. The method of any one of claims 14 to 17, wherein said context-specific dataset is generated to comprise at least some of said genuine data component values from said source dataset.
19. The method of any one of claims 14 to 18, wherein said generating said contextspecific dataset is done before the data request is received.
20. The method of any one of claims 14 to 19, wherein the access privilege is based on an estimated likelihood a given data object in said context-specific data set can be associated with one of the identifiable data subjects.
52
21. The method of claim 20, wherein said permitted access privilege is based on one or more access permissions associated with said data requestor.
22. The method of any one of claims 14 to 21, wherein the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on said permitted access privilege.
23. The method of any one of claims 14 to 22, wherein said access privilege comprises a risk acceptability threshold (RAT).
24. A computer-readable medium having stored thereon instructions for execution by a computing device for fulfilling a data request from a data requestor for a plurality of data objects relating to one or more identifiable data subjects, said computing device being in network communication with a plurality of network-accessible hardware storage resources each being in network communication and configured for distributed storage of a source dataset comprising a plurality of source data objects, each of said source data objects comprising constituent genuine data component values that are associated with a corresponding data subject, the instructions executable to automatically implement the steps of: generating a plurality of synthetic data component values at least in part preserving, one or more relationships between the genuine data component values amongst at least some of the source data objects; storing said plurality of synthetic data component values; in response to the data request, the data request having permitted access privilege specific to a particular data usage instance associated therewith, generating a contextspecific dataset; wherein said context-specific dataset corresponds at least in part to the plurality of source data objects and comprises at least some of said synthetic data component values based at least in part on said permitted access privilege.
25. The computer-readable medium of claim 24, wherein said synthetic data component values are generated using a generative model.
26. The computer-readable medium of either one of claim 24 or claim 25, the steps further comprising: generating de-identified data component values corresponding to at least some of said genuine data component values; and storing said de-identified data component values; wherein said context-specific data is generated to further comprise at least some said de-identified data component values depending on, at least in part, said permitted access privilege.
27. The computer-readable medium of claim 26, wherein replacing some of said genuine data component values with the corresponding said de-identified data component values comprises replacing to a designated threshold of de-identified data component values depending on said permitted access privilege.
28. The computer-readable medium of any one of claims 24 to 27, wherein said context-specific dataset is further generated to comprise at least some of said genuine data component values from said source dataset.
29. The computer-readable medium of any one of claims 24 to 28, wherein said generating said context-specific dataset is done before said data request being received.
30. The computer-readable medium of any one of claims 24 to 29, wherein said permitted access privilege is based on an estimated likelihood a given data object in said context-specific data set can be associated with one of the identifiable data subject.
31. The computer-readable medium of claim 30, wherein said permitted access privilege is based on one or more access permissions associated with said data requestor.
32. The computer-readable medium of any one of claims 24 to 31, wherein the at least some synthetic data component values means at least a designated threshold of synthetic data component values depending on said permitted access privilege.
33. The computer-readable medium of any one of claims 24 to 32, wherein said permitted access privilege is a risk acceptability threshold (RAT).
55
PCT/CA2022/051436 2021-10-04 2022-09-28 Data governance system and method WO2023056547A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163251936P 2021-10-04 2021-10-04
US63/251,936 2021-10-04

Publications (1)

Publication Number Publication Date
WO2023056547A1 true WO2023056547A1 (en) 2023-04-13

Family

ID=85803125

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2022/051436 WO2023056547A1 (en) 2021-10-04 2022-09-28 Data governance system and method

Country Status (1)

Country Link
WO (1) WO2023056547A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180165475A1 (en) * 2016-12-09 2018-06-14 Massachusetts Institute Of Technology Methods and apparatus for transforming and statistically modeling relational databases to synthesize privacy-protected anonymized data
WO2019144214A1 (en) * 2017-10-10 2019-08-01 Phemi Systems Corporation Methods and systems for context-specific data set derivation from unstructured data in data storage devices
US20190278943A1 (en) * 2016-05-11 2019-09-12 MDClone Ltd. Computer system of computer servers and dedicated computer clients specially programmed to generate synthetic non-reversible electronic data records based on real-time electronic querying and methods of use thereof
WO2020148573A1 (en) * 2019-01-18 2020-07-23 Telefonaktiebolaget Lm Ericsson (Publ) Using generative adversarial networks (gans) to enable sharing of sensitive data
US20210165913A1 (en) * 2019-12-03 2021-06-03 Accenture Global Solutions Limited Controlling access to de-identified data sets based on a risk of re- identification
US20210232705A1 (en) * 2018-07-13 2021-07-29 Imagia Cybernetics Inc. Method and system for generating synthetically anonymized data for a given task

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190278943A1 (en) * 2016-05-11 2019-09-12 MDClone Ltd. Computer system of computer servers and dedicated computer clients specially programmed to generate synthetic non-reversible electronic data records based on real-time electronic querying and methods of use thereof
US20180165475A1 (en) * 2016-12-09 2018-06-14 Massachusetts Institute Of Technology Methods and apparatus for transforming and statistically modeling relational databases to synthesize privacy-protected anonymized data
WO2019144214A1 (en) * 2017-10-10 2019-08-01 Phemi Systems Corporation Methods and systems for context-specific data set derivation from unstructured data in data storage devices
US20210232705A1 (en) * 2018-07-13 2021-07-29 Imagia Cybernetics Inc. Method and system for generating synthetically anonymized data for a given task
WO2020148573A1 (en) * 2019-01-18 2020-07-23 Telefonaktiebolaget Lm Ericsson (Publ) Using generative adversarial networks (gans) to enable sharing of sensitive data
US20210165913A1 (en) * 2019-12-03 2021-06-03 Accenture Global Solutions Limited Controlling access to de-identified data sets based on a risk of re- identification

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DEL GROSSO GANESH; PICHLER GEORG; PIANTANIDA PABLO: "Privacy-Preserving Synthetic Smart Meters Data", 2021 IEEE POWER & ENERGY SOCIETY INNOVATIVE SMART GRID TECHNOLOGIES CONFERENCE (ISGT), IEEE, 16 February 2021 (2021-02-16), pages 1 - 5, XP033887454, DOI: 10.1109/ISGT49243.2021.9372157 *
IMTIAZ SANA; ARSALAN MUHAMMAD; VLASSOV VLADIMIR; SADRE RAMIN: "Synthetic and Private Smart Health Care Data Generation using GANs", 2021 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), IEEE, 19 July 2021 (2021-07-19), pages 1 - 7, XP033966204, DOI: 10.1109/ICCCN52240.2021.9522203 *
TORKZADEHMAHANI, R. ET AL.: "DP-CGAN: Differentially Private Synthetic Data and Label Generation", PROCEEDINGS OF THE IEEE /CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW, 16 June 2019 (2019-06-16), Long Beach, CA, USA, pages 98 - 104, XP033746975, DOI: 10.1109/CVPRW.2019.00018 *

Similar Documents

Publication Publication Date Title
Bertino et al. Data transparency with blockchain and AI ethics
US20160283473A1 (en) Method and Computer Program Product for Implementing an Identity Control System
Rai PcBEHR: patient-controlled blockchain enabled electronic health records for healthcare 4.0
Zhao et al. Research on electronic medical record access control based on blockchain
Marangappanavar et al. Inter-planetary file system enabled blockchain solution for securing healthcare records
Baysal et al. Blockchain technology applications in the health domain: a multivocal literature review
EP3188072B1 (en) Systems and methods for automatic and customizable data minimization of electronic data stores
Xu et al. Decentralized autonomous imaging data processing using blockchain
Al-Abdullah et al. Designing privacy-friendly data repositories: a framework for a blockchain that follows the GDPR
Francined Herrera-Cubides et al. Security aspects in web of data based on trust principles. A brief of literature review
CA2986320A1 (en) Methods and systems for context-specific data set derivation from unstructured data in data storage devices
Demir et al. A decentralized file sharing framework for sensitive data
Stojanov et al. Linked data authorization platform
Bindlish et al. Blockchain in Health Care: A Review
WO2023056547A1 (en) Data governance system and method
Das et al. Proposing a Model to Enhance the IoMT-Based EHR Storage System Security
Canfora et al. A three-layered model to implement data privacy policies
Roussev The cyber security body of knowledge
Adamakis et al. Visualizing the risks of de-anonymization in high-dimensional data
WO2019144214A1 (en) Methods and systems for context-specific data set derivation from unstructured data in data storage devices
Jain et al. Privacy-preserving record linkage with block-chains
Jain et al. An approach towards the development of scalable data masking for preserving privacy of sensitive business data
Park et al. A Consent-Based Privacy-Compliant Personal Data-Sharing System
Mohamed A New Auditing Mechanism for Open Source NoSQL Database: A Case Study on Open Source MongoDB Database
Haraty et al. Role-based access control modeling and validation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22877724

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE