CA2986320A1 - Methods and systems for context-specific data set derivation from unstructured data in data storage devices - Google Patents

Methods and systems for context-specific data set derivation from unstructured data in data storage devices Download PDF

Info

Publication number
CA2986320A1
CA2986320A1 CA2986320A CA2986320A CA2986320A1 CA 2986320 A1 CA2986320 A1 CA 2986320A1 CA 2986320 A CA2986320 A CA 2986320A CA 2986320 A CA2986320 A CA 2986320A CA 2986320 A1 CA2986320 A1 CA 2986320A1
Authority
CA
Canada
Prior art keywords
data
value
key
context
derived
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA2986320A
Other languages
French (fr)
Inventor
Russ Weeks
Tristen Georgiou
Tim To
Josef Roehrl
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuseforward Technology Solutions Ltd
Original Assignee
Phemi Systems Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Phemi Systems Corp filed Critical Phemi Systems Corp
Priority to CA2986320A priority Critical patent/CA2986320A1/en
Priority to PCT/CA2018/051268 priority patent/WO2019144214A1/en
Publication of CA2986320A1 publication Critical patent/CA2986320A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

Described are various embodiments of systems, methods, and devices relating to the generation of independent context-specific datasets based on existing raw data sets, some embodiments comprising a plurality of data storage components for storage of a plurality of data objects; and a processing component having a data object key value store accessible thereto, said data object key value store configured to store a unique key-value logical row for constituent data object components of a data object, each such key-value logical row comprising a key for uniquely identifying the key-value logical row; a constituent data object component value for providing value information relating to the constituent data object component; and a metadata descriptor for describing a data object component characteristic of the constituent data object component value; wherein at least one of the constituent data object components are derived from raw data and at least one of the constituent data object components are derived from one or more other constituent data object components; and wherein, in response to a data access request based on one or more metadata descriptors.

Description

METHODS AND SYSTEMS FOR CONTEXT-SPECIFIC DATA SET DERIVATION
FROM UNSTRUCTURED DATA IN DATA STORAGE DEVICES
FIELD OF THE DISCLOSURE
[0001] The present disclosure relates to scalable, secure, and policy-compliant distributed data storage systems, and, in particular, to methods and systems for context-specific data set derivation from unstructured data in data storage devices.
BACKGROUND
[0002] Organizations are storing increasingly large amounts of diverse, unstructured data. This data is not typically in a format that can be easily analyzed and stored using traditional storage and analytics systems. Moreover, there is often sensitive information within the data and there is a need to ensure it is properly governed. This includes not only granular access control but also the ability to create and supply datasets that conform to existing privacy and compliance standards and follow strict retention policies on the data.
[0003] Two approaches that have been adopted across various industries include (i) the use of relational databases, and (ii) Hadoop file systems, used for distributed storage and for processing dataset of big data. Relational databases apply either row/column level security or stricter security at the application level, and therefore have difficulty in scaling and applying operational requirements to a more granular set of data, including when the granularization requirements may be dynamic or unknown at data upload.
Hadoop file systems use heterogeneous data stores with an external governance framework that can be used to tag and govern the data. The level of control, in Hadoop systems, varies according to the capabilities of the underlying system. For instance, a document store can protect at the document level but not within the document (which may also be a problem for relational databases as well). This makes it difficult to apply a uniform and granular governance model to the data and to create flexible enough datasets to conform to modern privacy rules.
1082P-RRI-CAD! 1
[0004] Further, relational databases are unable to scale to the sizes required to store big data today (Petabytes). They are often expensive to scale and will scale vertically while a key/value store is horizontally scalable. They do not provide flexible curation of the data since they have fixed schemas and cannot run any type of processing function across all the data stored. Databases are also limited to row/column level security and although finer grained security can be achieved at the application layer, this is a complex approach that requires a heavy-weight update to both the database and application whenever the security model changes.
[0005] A need exists for methods and systems that provide for context-specific data set derivation from unstructured data in data storage devices that overcome some of the drawbacks of known techniques, or at least, provide a useful alternative thereto.This background information is provided to reveal information believed by the applicant to be of possible relevance. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art or forms part of the general common knowledge in the relevant art.
SUMMARY
[0006] The following presents a simplified summary of the general inventive concept(s) described herein to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. It is not intended to restrict key or critical elements of the invention or to delineate the scope of the invention beyond that which is explicitly or implicitly described by the following description and claims.
[0007] The disclosure is a system and a method for granular governance and flexible curation of digital assets. Further, the inventive subject matter disclosed herein provides, in some embodiments, a flexible framework to govern and curate unstructured and structured data and create datasets that can be used for, as an example, operational use and analytics within the proper context; and further, in some embodiments, in a manner that can comply with changing legal and regulatory requirements. Accordingly, granular management of data sets that can scale with the rapid growth of data storage and processing requirements, with a customizable approach to policy compliance is required.
100081 In accordance with one aspect, there is provided a data storage system for fulfilling a data request for a context-specific data set, said context-specific data set based on a raw data set, the system comprising: a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of data objects; a digital data processor responding to data access requests received over a network and relating to the data objects; a key-value store comprising a unique key-value logical row for each constituent data component of each of said data objects, each said unique key-value logical row comprising: a key for identifying said unique key-value logical row; a constituent data component value comprising stored digital information relating to said constituent data component associated with said unique key-value logical row;
and a metadata descriptor describing metadata of said constituent data component value;
wherein at least one key-value logical row for a given data object is a direct key-value logical row directly associated with the raw data set and wherein at least one key-value logical row for the given data object is a derived key-value logical row derived from one or more other key-value logical rows; wherein, upon said digital data processor generating the context-specific data set responsive to a given data request to the data storage system, said digital data processor further generates a re-identification risk value for the context-specific data set to be associated therewith, said re-identification risk value representative of a likelihood that a given constituent data component used to generate the context-specific data set can be directly associated with an identifiable subject to which said given constituent data component pertains; and wherein said given data request is selectively fulfilled by the data storage system as a function of said re-identification risk value.
[0009] In accordance with some aspects, there are provided data storage systems wherein said re-identification risk value may optionally be generated based on similarities between an aspect of the given constituent data component, and a corresponding aspect of at least one other constituent data component used in the context-specific data set.

[0010] In accordance with some aspects, there are provided data storage systems wherein the re-identification risk may be generated based on at least one of the following calculated properties of the aspect of the context-specific data set: k-anonymity, t-closeness, /-diversity, and privacy differential.
[0011] In accordance with some aspects, there are provided data storage systems wherein each key-value logical row may further comprise a sensitivity value indicating a sensitivity associated with a corresponding key-value logical row.
[0012] In accordance with some aspects, there are provided data storage systems wherein the sensitivity value may be associated with one or more of the following: a permissible requesting user identifier for the corresponding key-value logical row or an aspect thereof, a predetermined sensitivity tag associated with one or more aspects of the corresponding key-value logical row, the raw data sets from which the corresponding key-value logical row originated, or the metadata descriptor of the corresponding key-value logical row.
[0013] In accordance with some aspects, there are provided data storage systems wherein said given data request for the context-specific data set may be generated in response thereto selectively fulfilled solely upon said re-identification risk value associated with the context-specific data set being lower than a designated re-identification risk threshold.
[0014] In accordance with some aspects, there are provided data storage systems wherein said re-identification risk threshold may be automatically determined by the data storage system based on whether a requesting computing device is within a designated zone of trust.
[0015] In accordance with some aspects, there are provided data storage systems wherein said re-identification risk threshold may be automatically determined by the data storage system based on one or more of: an identity of a requesting user, a role of the requesting user, sensitivity of data components in the context-specific data set, a location of a requesting computing device, a security indication of the requesting computing device, or a combination thereof [0016] In accordance with some aspects, there are provided data storage systems wherein at least one said derived key-value logical row may be automatically generated from the raw data set upon importing such raw data set into the data storage system.
[0017] In accordance with some aspects, there are provided data storage systems wherein at least one said derived key-value logical row may be derived upon request for such derivation by a user of the system. In accordance with some aspects, there are provided data storage systems wherein at least one said derived key-value logical row may be automatically derived from one or more pre-existing direct or derived key-value logical rows that are associated with said given data request so to automatically increase a similarity between corresponding aspects of constituent data components derived from said one or more existing key-value logical rows and thus reduce a given re-identification risk value associated with a derived context-specific data set relying on said at least one derived key-value logical row given said similarity increase.
[0018] In accordance with some aspects, there are provided data storage systems wherein said derived key-value logical row may be derived by obfuscating the constituent data component value of said pre-existing key-value logical rows to generate the constituent data component value of the derived key-value logical row, and the corresponding metadata descriptor of the derived key-value logical row being generated based on said obfuscating.
[0019] In accordance with some aspects, there are provided data storage systems wherein said derived context-specific data set may be automatically generated by the data storage system upon said re-identification risk value associated with a first context-specific data set being too high to permit selective fulfilment of said given data request.
In accordance with some aspects, there are provided data storage systems wherein said derived context-specific data set may be generated automatically upon said re-identification risk value associated with a first context-specific data set being higher than a first designated threshold.

[0020] In accordance with one aspect, there is provided a data storage method for fulfilling a data request for a context-specific dataset based one or more raw data sets, the method implemented on a data storage system comprising a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of data objects, and a digital data processor for responding to data storage requests received over a network and relating to said data objects, the method comprising: storing a key-value store comprising a unique key-value logical row for each constituent data component of each data object, each key-value logical row comprising: a key for identifying the key-value logical row; a constituent data component value comprising stored digital information relating to the constituent data component associated with the key-value logical row; and a metadata descriptor describing metadata of a data component value; directly generating at least one of the key-value logical rows for a given data object from raw data; deriving at least one of the key-value logical rows for the given data object from other key-value logical rows;
generating the context-specific data set responsive to the data request;
generating, a re-identification risk value for the context-specific data set, the re-identification risk value indicating a likelihood that a given constituent data component used to generate the context-specific data set can be directly associated with an identifiable subject to which said given constituent data component pertains; and selectively fulfilling the context-specific data request as a function of said re-identification risk value.
[0021] In accordance with some aspects, there are provided data storage methods wherein said re-identification risk value may be generated based on similarities between an aspect of the given constituent data component, and a corresponding aspect of at least one other constituent data component used in the context-specific data set.
[0022] In accordance with some aspects, there are provided data storage methods wherein said re-identification risk value may be generated based on at least one of the following calculated properties of the aspect of the context-specific data set: k-anonymity, t-closeness, /-diversity, or privacy differential.

[0023] In accordance with some aspects, there are provided data storage methods wherein each key-value logical row may further comprise a sensitivity value indicating a sensitivity associated with the corresponding key-value logical row. In accordance with some aspects, there are provided data storage methods wherein the sensitivity value may be associated with one or more of the following: a permissible requesting user identifier for the corresponding key-value logical row or an aspect thereof, a predetermined sensitivity tag associated with one or more aspects of the corresponding key-value logical row, the raw data sets from which the corresponding key-value logical row originated, or the metadata descriptor of the corresponding key-value logical row.
[0024] In accordance with some aspects, there are provided data storage methods wherein said step of selectively fulfilling may comprise fulfilling the data request solely upon said re-identification risk value associated with the context-specific data set being lower than a designated risk threshold. In accordance with some aspects, there are provided data storage methods wherein said risk threshold may be determined based on whether a requesting computing device is within a designated zone of trust. In accordance with some aspects, there are provided data storage methods wherein said risk threshold may be determined based on one or more of: an identity of a requesting user, a role of the requesting user, sensitivity of data components in the context-specific data set, a location of a requesting computing device, a security indication of the requesting computing device, or a combination thereof [0025] In accordance with some aspects, there are provided data storage methods wherein further comprising: automatically generating a derived context-specific data set to fulfil the data request, wherein the derived context-specific data set is based on at least one derived key-value logical row that is automatically derived from one or more pre-existing direct or derived key-value logical rows associated with the data request so to automatically increase a similarity between corresponding aspects of constituent data components derived from said one or more pre-existing key-value logical rows and thus reduce a given re-identification risk value associated with said derived context-specific data set given said similarity increase.

100261 In accordance with some aspects, there are provided data storage methods wherein the at least one derived key-value logical row may be derived by obfuscating the constituent data component value of the corresponding one or more pre-existing key-value logical rows to generate the constituent data component value of the derived key-value logical row, and the corresponding metadata descriptor of the derived key-value logical row being generated based on said obfuscating.
100271 In accordance with some aspects, there are provided data storage methods wherein the derived context-specific data set may be generated upon the re-identification risk associated with a first context-specific data set being too high to permit selective fulfilment of the data request. In accordance with some aspects, there are provided data storage methods wherein the derived context-specific data set is generated automatically upon the re-identification risk value associated with a first context-specific data set being higher than a first designated risk threshold.
100281 In accordance with one aspect, there is provided a device for fulfilling a data request for a context-specific dataset based on an existing raw data set, the device being in network communication with a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication, and configured for distributed storage of data objects, the device comprising: a digital data processor for responding to data storage requests received over a network and relating to said data objects; and a network communications interface for communicatively interfacing one or more requesting users and a key-value store configured to store a unique key-value logical row for each constituent data object component of each data object, each such key-value logical row comprising: a key for identifying the key-value logical row; a constituent data component value comprising stored digital information relating to the constituent data component associated with the key-value logical row; and a metadata descriptor describing metadata of the constituent data component value;
wherein at least one key-value logical row for a given data object is a direct key-value logical row directly associated with raw data and wherein at least one key-value logical row for the given data object is a derived key-value logical row derived from one or more other key-value logical rows; and wherein, upon said digital data processor generating the context-specific data set responsive to a given data request to the data storage system, said digital data processor further generates a re-identification risk value for the context-specific data set to be associated therewith, said re-identification risk value representative of a likelihood that a given constituent data component used to generate the context-specific data set can be directly associated with an identifiable subject to which said given constituent data component pertains; and wherein said given data request is selectively fulfilled by the data storage system as a function of said re-identification risk value.
[0029] In accordance with one aspect, there is provided a computer-readable medium having stored thereon instructions for execution by a computing device for fulfilling a data request for a context-specific dataset based on an existing raw data set, said computing device being in network communication with a data storage system comprising a plurality of data storage components, each of said data storage components being in network communication, and configured for distributed storage of a plurality of data objects, each said data object comprising of a plurality of constituent data object components, the instructions executable to automatically implement the steps of any one of the methods disclosed herein.
In accordance with one aspect, there is provided a data storage system for fulfilling a data request for a context-specific data set, said context-specific data set based on a raw data set, the system comprising: a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of data objects; a digital data processor responding to data access requests received over a network and relating to the data objects; a key-value store comprising a unique key-value logical row for each constituent data component of each of said data objects, each said unique key-value logical row comprising: a key for identifying said unique key-value logical row; a constituent data component value comprising stored digital information relating to said constituent data component associated with said unique key-value logical row; and a metadata descriptor describing metadata of said constituent data component value; wherein, in response to a given data request, said digital data processor: generates a first context-specific data set based on existing key-value logical rows; associates a re-identification risk value with said first context-specific data set representative of a likelihood that a given constituent data component used to generate said first context-specific data set can be directly associated with an identifiable subject to which said given constituent data component pertains;
selectively fulfils said given data request based on said re-identification risk value by:
providing access to said first context-specific data set upon said re-identification risk satisfying a designed risk criteria; otherwise automatically generating and providing access to a derived context-specific data set so to fulfil the data request, wherein the derived context-specific data set is based on at least one derived key-value logical row that is automatically derived from one or more pre-existing direct or derived key-value logical rows associated with the data request so to automatically increase a similarity between corresponding aspects of constituent data components derived from said one or more pre-existing key-value logical rows and thus reduce a given re-identification risk value associated with said derived context-specific data set given said similarity increase.
100301 The system can receive unstructured or structured data as an input. In some cases, the input data could be acquired from a patient record, a financial record or other type of record and can come in several formats such as PDF, CSV or other types of electronic or non-electronic inputs. The input data will go through an initial process, sometimes referred to as an ingestion process, or alternatively referred to as a data input process, which consists of obtaining data from its original form in the raw data and storing it in a logical row in a key-value store. This original form of the data may be referred to as "raw data" and can include text files, PDF, CSV, spreadsheets, etc. Upon or even during ingestion or data input, the system automatically associates metadata information with the data stored in key/value pairs within a logical row in the key value store. This metadata data may provide information that describes the data or a characteristic thereof, including contextual, descriptive and governance information such as the origin, ownership, integrity information (such integrity information, in turn, including but not limited to size, encoding, checksum information, and retention requirements information) and other governance-related information.
Embodiments of the subject matter disclosed herein may be employed by a user to add additional logical rows, and thus additional metadata information regarding data, a data object, or one or more other logical rows at any time after data input to further describe context or attributes of the data (or data object).
[0031] In some embodiments, there may also be provided a framework for executing distributed processing functions to curate the raw data into different forms.
Once ingested, the collection of all logical rows in the key value store make up all the data relating to a given data object, and may be referred to as a digital asset.
This curation can occur at the time of ingest or after ingest. In some embodiments, curation may consist of the following functions, inter alia: (1) The extraction or computation of derived data from the original data; and (2) The addition of context to the data. These functions can be considered to be generating additional metadata information associated with the original raw data, data object, or data relating to the data object, and which is stored with a value relating to said additional metadata information alongside the raw data in key value pairs.
In this way, additional information, descriptors, context, and governance information can be associated with a data object and/or data asset, either at the time of ingestion or later.
Datasets relating to a set of data objects can be generated based on the existing metadata information in the applicable logical rows, which can then be presented to different users at different times depending on context; no access to the original data is required and the nature and level of access may be governable in a highly customized, dynamic, and granular fashion.
[0032] Embodiments may also provide a distributed execution framework, which simplifies the process of writing distributed jobs to curate data thereby enabling developers who are not familiar with a distributed system to write data processing functions that extract, generate and store additional derived data.
[0033] Embodiments may also provide the ability to use metadata to generate on-demand context-specific datasets consisting of metadata and/or raw data. These can be exported and protected with privacy or other access and compliance rules.
[0034] Other aspects, features and/or advantages will become more apparent upon reading of the following non-restrictive description of specific embodiments thereof, given by way of example only with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE FIGURES
[0035] Several embodiments of the present disclosure will be provided, by way of examples only, with reference to the appended drawings, wherein:
[0036] Figure 1 shows an exemplary architecture in accordance with one embodiment of the instant disclosure.
[0037] Figure 2 shows a schematic of a system in accordance with another aspect of the instant disclosure.
[0038] Figure 3 shows an exemplary schema of a key-value store in accordance with an aspect of the instant disclosure.
[0039] Figure 4 shows a conceptual schema and workflow in accordance in accordance with an aspect of the instant disclosure.
[0040] Figure 5 shows another conceptual schema and workflow in accordance with an aspect of the instant disclosure.
[0041] Figure 6 shows a conceptual workflow for deriving datasets in accordance with an aspect of the instant disclosure.
100421 Elements in the several figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be emphasized relative to other elements for facilitating understanding of the various presently disclosed embodiments. Also, common, but well-understood elements that are useful or necessary in commercially feasible embodiments are often not depicted in order to facilitate a less obstructed view of these various embodiments of the present disclosure.

DETAILED DESCRIPTION
[0043] Various implementations and aspects of the specification will be described with reference to details discussed below. The following description and drawings are illustrative of the specification and are not to be construed as limiting the specification.
Numerous specific details are described to provide a thorough understanding of various implementations of the present specification. However, in certain instances, well-known or conventional details are not described in order to provide a concise discussion of implementations of the present specification.
[0044] Various apparatuses and processes will be described below to provide examples of implementations of the system disclosed herein. No implementation described below limits any claimed implementation and any claimed implementations may cover processes or apparatuses that differ from those described below. The claimed implementations are not limited to apparatuses or processes having all of the features of any one apparatus or process described below or to features common to multiple or all of the apparatuses or processes described below. It is possible that an apparatus or process described below is not an implementation of any claimed subject matter.
[0045] Furthermore, numerous specific details are set forth in order to provide a thorough understanding of the implementations described herein. However, it will be understood by those skilled in the relevant arts that the implementations described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the implementations described herein.
100461 In this specification, elements may be described as "configured to"
perform one or more functions or "configured for" such functions. In general, an element that is configured to perform or configured for performing a function is enabled to perform the function, or is suitable for performing the function, or is adapted to perform the function, or is operable to perform the function, or is otherwise capable of performing the function.

[0047] It is understood that for the purpose of this specification, language of "at least one of X, Y, and Z" and "one or more of X, Y and Z" may be construed as X
only, Y
only, Z only, or any combination of two or more items X, Y, and Z (e.g., XYZ, XY, YZ, ZZ, and the like). Similar logic may be applied for two or more items in any occurrence of "at least one ..." and "one or more..." language.
[0048] The systems and methods described herein provide, in accordance with different embodiments, different examples in which provides for the ability to create different views of data, using data sets derived from the actual or live data, depending on the contextual requirements relating to the data and/or data consumer. Such requirements often include privacy compliance but may also include other administrative, analytics, management, or use-related requirements.
[0049] In some embodiments, there are provided a methods and systems relating to data storage that leverage the capability of a key-value store. Some embodiments may utilize one or more storage devices, each of which may further comprise storage sub-elements (for example, a server comprising a plurality of storage blades that each in turn comprise multiple storage elements of the same or different types, e.g. flash or disk).
Very large data sets may be distributed amongst many different local or remote storage elements; they may be closely stored (e.g. on the same device or on directly connected devices, such as different blades on the same server) or they may be highly disparately and remotely stored (e.g. on different, but networked, server clusters).
Furthermore, the data stored may be duplicated for a number of reasons, including redundancy and failure handling, as well as efficiency (e.g. to store a copy of information that has been recently used "close" to other required data). Systems and methodologies for managing such large data and complex sets have been developed (e.g. HDFS for HadoopTm). Overlaying such complexity, however, there are disclosed methodologies, devices, and systems for storing, accessing, and using very large data sets using a key-value store to ingest and store data from raw data sources (e.g. a patient or financial record) in a highly granular fashion.

[0050] For a given data object, such as, for example, a patient record or indeed a patient, at least some if not all of the available data are ingested as individual discrete portions of data, along with a metadata descriptor of each portion; a key is associated with the entry for, in part, future identification. Accordingly, the key-value store comprises logical rows, wherein each logical row comprises an individual portion of the raw data or constituent data component value (the "value"), an identifier (the "key"), a metadata descriptor, a data object identifier, and, optionally in different embodiments, additional management information, such as authorization, sensitivity or other compliance information and/or timestamp information. The key-value store that comprises logical rows, wherein each logical row comprises a constituent data component value and a key identifier may also be referred to as the key-value pair. The collection of all logical rows for a given data object comprises the digital asset (typically, the data asset will also include the raw data, however, in many embodiments, there will be a logical row associated wit'h the raw data; e.g. a patient record in a text file or PDF
format). The concept of a data object may, in some embodiments, be considered to broader than the data asset, and refer to all information, whether existing or potential, regarding any entity, such as a patient, hospital, doctor, bank, transaction, etc. In one exemplary embodiment, considering an existing patient as a data object and a patient record as the raw data, a first logical row may consist of an object id relating to the patient, a unique identifier (the key), a metadata descriptor of "raw data", and a value being the patient record data file itself; from the raw data file, additional logical rows are created for every discrete portion of raw data. Additional logical rows can then be derived from the existing logical rows as well other applicable information; for example, derived logical rows corresponding to existing logical rows can be generated that aggregate or obfuscate existing logical rows. When combined with specific other logical rows, any of the existing logical rows, either imported (i.e. ingested) or derived (i.e.
curated), can be provided along with ¨ or excluded from - access requests associated with the derived logical row. Because there are very few limits on how such derived logical rows can be generated, and all of the data of the data asset are highly granularized to individual discrete pieces of data, provision and use of data associated with any given data object (or class or group of data objects) can be managed at the level of each such piece of data.

That is, far below the level of the data object or table level as would be the limitation in state of the art systems. It some embodiments, the value-portion of a given logical row may be the actual value (or data), or it may be a reference, direct or indirect, to the value and/or storage location of the value.
[0051] In embodiments, a key-value store may be employed for granular governance and flexible curation of digital assets. Embodiments hereof can receive unstructured or structured data as an input. In some cases, the input data could be acquired from a patient record, a financial record or other type of record and can come in several formats such as PDF, CSV or other types of electronic or non-electronic inputs. In accordance with one aspect, a key-value store is a data storage structure designed for storing, retrieving, and managing associative arrays, which contains a collection of objects or records, which in turn has different fields within them, each containing data. In some embodiments, the data included in a data collection will have related attributes so that the data can be stored, retrieved and managed in an efficient manner and this data collection can be derived, generated or calculated after or during curation. These records are stored and retrieved using a key identifier that uniquely identifies the record, and is used to quickly find data within a database. In addition to storing, retrieving, and managing associative arrays using the key identifier, disclosed implementations of the key-value store allow generation of context-specific datasets that are generated from the key-value store itself (keeping in mind that in some embodiments the "value" portion of a logical row can be the associated piece of data, or a reference thereto). Such generated datasets may be based on further utilization of additional descriptors and indicators, depending on the data access request.
[0052] In some embodiments, raw data may comprise any type of raw data in various formats, including PDF files, text files, CSV, database information, and spreadsheet documents, is extracted and stored as a data object comprising a key-value logical row, which comprises at least constituent data component values and the associated metadata descriptors. The data object is associated with the raw data, as well as all other logical rows that have been or may be created. Multiple and separate records relating to a data object, e.g. a patient, may constitute an example where a data object may be associated with more than one raw data set. In some embodiments, at run-time and/or subsequent to the ingestion or receipt of the raw data, metadata of the raw data are collected, derived, or formulated and are stored as key-value logical rows, with its unique key, constituent data component values and associated metadata descriptor. In embodiments, the metadata associated with a given logical row is a type of data that describes and gives information about the data to which the logical row pertains. For example, the metadata could be "raw data", "file type", "patient ID", "name", with the value associated therewith, as extracted from the raw data or a derived from other data, stored in the same logical row. Each collected, derived, or formulated key-value entry is stored in the key-value data store as a key-value logical row, the rows collectively forming a data asset or a portion thereof.
Examples of metadata include the name of the file, the type of the file, the time the file was stored, the raw data itself, and the information regarding who stored the file. The collected information gets parsed and saved in the key-value store as a key-value logical row with its respective key for unique identification, constituent data component value, and metadata descriptors. Concurrent to the collection of the information, at the run time or at subsequent times when the raw data exists in the key-value store, the raw data may be parsed for acquisition of metadata. The acquired metadata are stored in the key-value store with respective key for unique identification, constituent data component value, and metadata descriptors. The metadata preliminarily derived are saved as key-value logical rows in the key-value store, key-value logical rows collectively forming a data object associated with a raw data. First name, last name, type of disease, date of financial transaction and age are some examples of the acquired data. Furthermore, derived metadata may be derived from other logical rows, including either raw data, acquired data from the raw data, or other derived data; in some embodiments, it may be derived from other information associated with a data object, rather than directly from the existing data asset. The metadata associated with derived logical rows are stored in the key-value store as part of the key-value logical rows with the logical row unique identifier (such unique identifier being a unique key), a data object identifier, and constituent data component value. In some embodiments, metadata may be employed to formulate and output context and requestor specific dataset. For example, a data set may be generated from a key-value store by accessing only obfuscated logical rows, as well as other lower-sensitivity (or other access criteria); accordingly, a derived data set that is separate from the raw data, or even the key-value store data is specifically produced for a certain context ¨
and that context may be determined or created by generating specific types of logical rows based on pre-determined metadata. Another example may include a patient dataset where a derived logical row includes an age range, or first three digits of a postal code, and the resulting derived dataset is generated by accessing all non-identifying information regarding disease types and outcomes for a group of patients along with the aforementioned derived logical row; without providing access to the raw data, an analysis of the dataset can be performed wherein disease frequency by age or location can be assessed without giving any direct access to sensitive information. As the logical rows can be generated before ingestion for automatic curation or after for more customized curation, dataset creation can be dynamic and compliant irrespective of the type of information stored regarding data objects.
100531 In some embodiments, the use of a key-value store paradigm, such as Apache Accumulo, may be used to provide granular access control to the data. The use of a key-value store, such as Accumulo, provides cell-level security with a visibility field in the key. The use of a key-value store paradigm is a data model that stores raw data in a key-value pair and metadata values in the same logical row as additional key-value pairs. The column visibility field is used to store data attributes related to governance or compliance rules specified by the user.
100541 In some embodiments, the constituent data component value may comprise stored digital information directly, or point to a location in storage where the digital information is stored. In some embodiments, the metadata descriptors may be formed in response to data access request. In some embodiments, the data access request would comprise of pre-determined metadata descriptors and new metadata descriptors either by system administrator or end-user (i.e. request for a specific use and/or context). In some embodiments, the pre-determined metadata descriptors are the result of processing the raw data; these functions are sometimes referred to as data processing functions (DPF).
Each data processing functions is associated with a specific timestamp or version for all of the components that result from the processing. This associated timestamp is included in the key-value store and is similar to a version control feature. In some embodiments, this version control feature can allow for version roll back to a previous processed state and/or specific application of rules or data management of a processed dataset. Such timestamps can provide a mechanism to assess how a dataset changed over time as the state of the dataset can be assessed as it was at any point in time.
[0055] In some embodiments, the data can be accessed directly through an application programming interface (API), which can be a set of routines, protocols and, tools for building software applications. These direct access requests may occur through a library call for programmatic access in data science or a call through a representational state transfer (REST) API when accessing the data for an application. A query using these examples of direct data access may trigger a distributed routine to collect the data across various nodes. In another embodiment, the data may be access through a manufactured datasets and use the distributed compute capability of software tools, such as Accumulo and/or Spark, on the cluster to create batch jobs that use metadata descriptors to assemble the necessary dataset and to generate said dataset into the format requested.
In some embodiments, this dataset may be exported to a specified location to meet governance, privacy and/or compliance requirements.
[0056] The process of authorization regarding data access requests may be simplified for the administration by using tags, attributes and, expressions, which provides administrators with the ability to specify tags, attributes or expressions on the data at a high level. For example, using the Accumulo software will provide users with a visibility field that allows the use of arbitrary attributes such as PHI, PUBLIC and, DE-IDENTIFIED, which can then be assigned to users/groups for authorization. In addition, the use of Active Directory (AD) groups may be used to link users/groups to authorizations. In one exemplary embodiment, a customer may define a rule to a group called "researchers" in a specified AD location, such as "researcher authorization allows you to see data with attributes PUBLIC and DE-IDENTIFIED". The Accumulo infrastructure allows user attributes identified for users/groups to be defined and used in the same way; this attribute-based access control would authorize users/groups/AD with particular attributes to access data with particular attributes. In addition, there is a priority order of evaluation for rules in the case where the administrator specifies several rules that overlap.
[0057] In accordance with one aspect, the employment of a key-value store permits the storage and operation on at least four types of data, collected or derived, when a raw data is received or exists in the key-value store: metadata descriptive of the raw data (e.g.
the raw data file itself, file name, file type, file size, etc.), metadata derived from the raw data (e.g. patient name data from a the corresponding patient name field within the raw data file); metadata derived from the preliminarily derived metadata (e.g. a pre-determined category, such as age group where the value for such derived logical row is determined from another existing logical row where the metadata descriptor is age); and governance metadata (e.g. retention policies, authorization, owner, etc.). In some examples, the metadata derived from the raw data may be referred to as the tokenization of the original data; this refers to any operation to data associated with a data object, including other logical data rows, in order to protect, analyze, or generate new data from the existing raw data or generated data at a granular level. This tokenization can include obfuscation, aggregation, computation, and the application of filters.
Employing the metadata, the key-value store therefore allows formulation of datasets and access thereto based on context- and requestor-specific characteristics.
[0058] Each key-value logical row gets assigned a unique key for identification. In some embodiments, all key-value logical row associated to a given set of raw data may be assigned a unique key for identification. In some embodiments, all key-value logical rows associated with a data object may be assigned a unique key for identification. In other words, in some embodiments, when an example of disclosed system stores a raw data, it may assign a unique key identifier, grouping the metadata associated to the raw data as a single logical entity, or grouping the metadata associated to a data object associated to at least one raw data as a single logical entity. Each collected or derived datum with its unique key, associated metadata descriptors and corresponding constituent data component value is stored as a key-value logical row in the key value store. In some embodiments, examples of the metadata descriptors for each collected or derived datum include an accessibility authorization and/or sensitivity descriptor and time-sequenced information, temporal-/locality-based associations.
[0059] Since key values can be used for, among other reasons, identifying, locating, and securing access to data objects, data can be indexed and accessed based on the existence of certain metadata, (1) data can be quickly accessed and located based on the existence of specified metadata within the key value store; (2) derived data sets can be generated directly from the key-value stored; and (3) regulatory and administrative compliance can be enforced at a data storage layer (as opposed to at an application layer).
[0060] In various embodiments of the system, key-value store is employed for granular governance and flexible curation of digital assets.
[0061] In an exemplary embodiment, there is provided a data storage system for generating context-specific datasets based on existing raw data sets. The data storage comprises of a plurality of data storage components and a processing component.
[0062] The plurality of data storage component exists in a network communication and is configured for distributed storage of a plurality of data objects, wherein each said data object comprises of a plurality of constituent data object components. An example of the plurality of data objects include a set of data related to or derived from either unstructured or structured data received by the system as an input. A
constituent data object component includes each set of data that form a part of the data object and may be generated automatically derived under system command, or formulated based on unique requests.
[0063] The processing component has a data object key value store accessible thereto, wherein the data object key value store stores a unique key-value logical row for each constituent data object component. In other words, each constituent data object component is stored in the data object key value store, as a unique key-value logical row.
[0064] Furthermore, each key-value logical row comprises: a key for uniquely identifying the key-value logical row; a constituent data object component value for providing component information relating to the constituent data object component associated with the key-value logical row; and a meta data descriptor for describing a data object component characteristic of the constituent data object component value. An example of a key for uniquely identifying the key-value logical row includes a unique identifier for all the data generated, derived, or formulated from an input received. An example of the constituent data object component value may entail actual values for a given constituent data object component; where an example of a metadata descriptor include names and age.
[0065] The system may derive at least one of the constituent data object components.
The system may further employ at least one of the constituent data object component values and derive at least one constituent data object component. In other words, the system may preliminarily derive constituent data object components. Then, using the values of the preliminarily derived constituent data object components, the system may further derive other constituent data object components. This operation may be performed by the system upon requests to the processing component, wherein the request triggers access to constituent data object component values comprising metadata descriptors.
[0066] In some embodiments, each key-value logical row embeds additional management information, such as an access authorization value for restricting access to the constituent data object component values, in response to requests associated with a corresponding authorization. This access authorization value can also be a sensitivity tag or other compliance and/or governance information and/or timestamp information. The access authorization value or sensitivity tag can correspond with a user identity, user role and/or a user group, restricting access to the constituent data object component values.
Some examples of constituent data objects may include restricting access to patient records, financial data or proprietary, confidential or sensitive data. Some examples of user roles, user identity or user groups may include doctors, researchers, banks, and underwriters. In some embodiments, the restriction of the constituent data object component values will be based on governance and/or compliance rules such as data retention, storage requirements, and data ownership. In another embodiments, rules associated with timestamp information or version control information can be used to restrict access to the constituent data objects. Some examples of using timestamp information may include restricting access to the most recent version of constituent data objects or limiting access to older versions of constituent data objects.
[0067] In another exemplary embodiment, at least one of the constituent data object components for a given key-value logical row are derived from the input raw data automatically upon storing the raw data associated with the data object in the data storage components. In one embodiment, the derived data sets may be associated with a set of pre-determined rules, or data processing functions (DPF), which can be used to produce metadata descriptors to the raw data or to add timestamp information or version control.
The derivation may take place under pre-determined requests, under data access requests, or by system administrator, both at run time or at subsequent times.
[0068] In another embodiment, these rules can be created during ingestion of the data or after the data was already ingested. In some embodiments, these data processing functions (DPF) are developed using a general purpose programming framework, such as Spark and/or MapReduce, which enables curation functions to be run across the data constituent data objects.
[0069] In accordance with one aspect, there is disclosed a data storage system for generating a context-specific data set based on a raw data set. A raw data set may include different formats of documents that may be provided to the data storage system. The context-specific data set is generated based on the raw data set, in accordance with specific requisitions made of the data storage system.
[0070] In accordance with one aspect, the data storage system comprises a plurality of network-accessible hardware storage resources, a digital data processor, and a key-value store. The plurality of network-accessible hardware storage resources is in network communication and configured for distributed storage of data objects. The data objects may include any type of data obtained, derived, formulated, and related to, including the raw data itself, upon the receipt of the raw data by the data storage system.
The digital data processor responds to data access requests received over a network, relating to the data objects. Said data access requests may come from end-users regarding the data objects stored in the data storage system. The key-value store is stored in said hardware storage and composed of a unique key-value logical row for each constituent data component of each of the data object in the data storage system. In accordance with one aspect, a data storage system may contain a number of data objects, which may be composed of constituent data components, related to a raw data set. In some embodiments, a set of data objects or a data object may be related to a raw data set provided to the data storage system. The data object may be composed of constituent data components that were received, derived, or formulated at the time of, or subsequent to the receipt of the raw data at the data storage system. These constituent data components may include various characteristics and information regarding the raw data itself, the data derived from the raw data, and the data formulated from the data regarding the raw data or derived from the raw data under given requisitions.
100711 Each said unique key-value logical row is composed of a key for uniquely identifying said unique key-value logical row, a constituent data component value, and a metadata descriptor. In some embodiments, the key for unique identification of said unique key-value logical row may be a value comprising stored digital information. In some embodiments, the key may be formulated from said constituent data component associated with said key-value logical row and a metadata descriptor. In some embodiments, the key may be a combination or combinations of constituent data component values and metadata descriptors. The constituent data component values comprise stored digital information relating to said constituent data component associated with said unique key-value logical row. This digital information may be a value directed obtained, derived, or formulated from the raw data received. In some embodiments, the digital information may store a value indicative of location of where the actual value is stored. Examples of the digital information include actual first name such as John and a pointer value to a designated location in a data storage. The metadata descriptor describes metadata of said constituent data component value. Metadata generally comprise data information that provides information about other data. In some embodiments, metadata describes a resource for purposes such as discovery and identification, including elements such as title, abstract, author, and keywords. In accordance with one aspect, metadata describes containers of data and indicates how compound objects are put together, examples of which include types, versions, relationships and other characteristics of digital materials. In some embodiments, metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information and who can access it.
100721 In accordance with one aspect, at least one key-value logical row for a given data object is directly associated with the raw data set and at least one key-value logical row for the given data object is derived from one or more other key-value logical rows.
Examples of directly associated key-value logical row include the data obtained at the time of the receipt of the raw data, such as file name and file type, and the data derived at the run time or at subsequent times, such as first name and last name. In some embodiments, said key-value logical row derived from one or more other key-value logical rows may be derived based on end-user requisitions. In some embodiments, the key-value logical row derived from one or more other key-value logical rows may be derived based on data administrator of the data storage system.
[0073] In accordance with one aspect, in response to a given data access request based on a given metadata descriptor, said digital data processor generates an independent data set via said key-value store by accessing those key-value logical rows having metadata descriptors responsive to said data access request. In some embodiments, the given metadata descriptor may be pre-determined by the system administrator or customized metadata by the end-user. In some embodiments, the metadata descriptors responsive may include metadata descriptors created at the run time, at subsequent times when said key-value logical rows were derived or formulated, or when a requisition based on the given metadata descriptor is made.
[0074] In some embodiments, said key-value logical row comprises an access authorization value for restricting access to the corresponding key-value logical row. In accordance with one aspect, the access authorization value may be stored digital information. In accordance with one aspect, the access authorization value may be a combination or combinations of constituent data component values and metadata descriptors. In some embodiments, the access authorization value may be employed for generation of the independent data set, in response to a given data access request, allowing control over the information accessed and the independent data set generation.
[0075] In some embodiments, examples of factors that may be associated with the access authorization include a requesting user identity, a requesting user role, a requesting user group, the constituent data component of the corresponding key-value logical row, the raw data sets from which the corresponding key-value logical row originated, and the metadata descriptor of the corresponding key-value logical row. In some embodiments, access authority would be determined at the time of the data access request, or at subsequent times, based on the above-noted factors.
[0076] In accordance with one aspect, said independent data set returned in response to said data access request is stored in the data storage system. In some embodiments, the independent data returned is not stored in the data storage system thereafter.
[0077] In accordance with one aspect, at least some of the key-value logical rows are automatically generated from the raw data set upon importing such raw data set into the data storage system. Examples of the key-value logical rows that are automatically generated from the raw data set upon importing such raw data set into the data storage system include file name and file type. In some embodiments, some of the derived key-value logical rows are derived upon a request for such derivation by a user of the data storage system. Examples of the key-value logical rows that are derived upon a request for such derivation by a user of the data storage system include first name and last name.
In some embodiments, additional key-value logical row is derived by obfuscating the constituent data component value of at least one existing key-value logical row to generate the constituent data component value of the additional key value logical row, and the corresponding metadata descriptor of the additional key-value logical row being generated based on said obfuscating. Examples of obfuscating include deliberate rendering of age obscure so as to not disclose the precise age, but place other key-value logical rows related to the same raw data as the mentioned age key-value, and make available for data access requisition with access authority for generation of an independent data set.

[0078] In accordance with one aspect, an additional key-value logical row may be derived by aggregating the constituent data component values of at least two existing key-value logical rows to generate the constituent data component value of the additional key-value logical row, and the corresponding metadata descriptor of the additional key-value logical row being generated based on said aggregating. Examples of aggregating include aggregating first name and last name to formulate an additional key-value logical row, with the corresponding metadata descriptor name. In some embodiments, examples of aggregating include aggregation of key-value logical rows related to a data object, associated to a raw data.
[0079] In accordance with one aspect, an additional key-value logical row may be derived through a function-based calculation based on the constituent data component values of at least one existing key-value logical row to generate the constituent data component value of the additional key-value logical row, and the corresponding metadata descriptor of the additional key-value logical row being generated based on said function-based calculation. Examples of said function-based calculation may include decision-making scheme, mathematical function, and other rules, to come up with additional key-value logical row and the corresponding metadata based on existing key-value logical rows.
[0080] In accordance with one aspect, where an additional key-value logical row is derived by obfuscating the constituent data component value of at least one existing key-value logical row to generate the constituent data component value of the additional key-value logical row, and the corresponding metadata descriptor of the additional key-value logical row, by said obfuscating, the access authorization value of the additionally derived key-value logical row may be the same as the existing key-value logical rows, from which the additional key-value logical row was derived, or different. In some embodiments, the access authorization for additional key-value logical row may be pre-determined in association with one or more of the following: a requesting user identify, a requesting user role a requesting user group, the constituent data component of the corresponding key-value logical row, the raw data sets from which the corresponding key-value logical row originated, and the metadata descriptor of the corresponding key-value logical row. In some embodiments, the access authorization for additional key-value logical row may be determined by the system administer or data access requestor.
100811 In accordance with one aspect, there is disclosed a data storage method for generating context-specific datasets based on a raw data sets, the method implemented on a data storage system comprising a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of data objects, and a digital data processor for responding to data storage requests received over a network and relating to said data objects. In some embodiments, the context-specific datasets based on raw data sets are generated upon receipt of data access requests by end-users of the method.
Examples of the data access requests may include specific requests for age range data for all the data objects in the data storage. Examples of network-accessible hardware storage may include spinning disks connected for distributed data storage. The method comprises storing a key-value store in one or more said hardware storage resources, directly generating at least one of the key-value logical rows for a given data object from raw data, deriving at least one of the key-value logical rows for the given data object from other key-value logical rows, and generating, in response to a data access request based on one or more metadata descriptors, an independent data set via said key-value store by accessing those key-value logical rows having metadata descriptors responsive to said data access request. In some embodiments, the key-value store comprises a unique key-value logical row for each constituent data component of each data object.
Constituent data component of each data object, with each data object related to at least one raw data, may include information about the raw data, such as file name and file type, information derived from the raw data, such as first name and last name, and information formulated through aggregating, employing function-based calculations, or responding to data access requests. Each key-value logical row comprises a key for uniquely identifying the key-value logical row, a constituent data component value comprising stored digital information relating to the constituent data component associated with the key-value logical row, and a metadata descriptor describing metadata of a data component value.
The key for unique identification may be stored digital information, which may be a combination or combinations of constituent data component values and metadata descriptors describing metadata of a data component value. The constituent data component may be an actual value or a pointer to the location of the storage where the actual value is stored. The key, the constituent data component and the metadata descriptor may be created, derived, or formulated at the run time or at subsequent times, in some embodiments pre-determined, in some embodiments under data access requests, and in some embodiments, by system administrator. In accordance with one aspect, at least one of the key-value logical rows for the given data object may be derived from other key-value logical rows. The derivation may take place under pre-determined requests, under data access requests, or by system administrator, both at run time or at subsequent times. In some embodiments, data access request is a request for data, which may be automatic, pre-determined, or user specific. For example, the data access request may be made by the end user or the system administrator. In another example, the data access request may be received at the run time or at subsequent times with the data object existent in the system.
100821 In accordance with one aspect, there is disclosed a device for generating context-specific datasets based on existing raw data sets, the device being in network communication with ta plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of data objects. The device comprises a digital data processor and a network communication interface. In some embodiments, the digital data processor responds to data storage requests received over a network and relating to said data objects. In some embodiments, the network communications interface communicatively interfaces one or more requesting users and a key-value stored on one or more of said hardware storage resources. The key-value store configured to store a unique key-value logical row for each constituent data object component of each data object comprises a key, a constituent data component value, and a metadata descriptor. At least one of the key-value logical rows for a given data object is directly associated with raw data and at least one of the key-value logical rows of the given data object is derived from one or more other key-value logical rows. In response to a data access request based on a given metadata descriptor, the digital data processor generates an independent data set from the key-value store by accessing those key-value logical rows having metadata descriptors responsive to said data access request.
[0083] In one exemplary embodiment, there is provided a system that consists of two manager nodes plus Hadoop-based cluster nodes, wherein each Hadoop-based cluster node in this exemplary system may comprises of computing devices that may be classified as either or both Hadoop master nodes and Hadoop data nodes. It should be noted that in other embodiments there may be one manager node or a plurality;
in either case, the master node functionalities described below may be carried out by a single master node, or distributed in various manners across the plurality, and that the subject matter hereof is not limited to systems with two manager nodes. The manager nodes may carry out the following functions: runs any centralized applications that manage the data storage and access functions (including management of the key-value store);
provides the web and other (e.g. REST) interface for data administration, privacy, security, and governance functions; hosts the any web, proxy, or other server functionality (e.g.
NGINX); manages and runs the master key distribution for administrators or other service principals (e.g. MIT Kerberos Key Distribution Center); runs data analysis or data function applications or libraries (e.g. the PHEMI Data Science Toolkit, Spark, and Zeppelin); manages and runs the slave key distribution for administrators or other service principals (e.g. MIT Kerberos Key Distribution Center); and hosts backup components for any other manager node in case of critical failure thereof.
[0084] Referring to Figure 1, there is shown a conceptual schematic of a reference configuration or architecture of the above-mentioned Example 1. In the embodiment shown 100, there are two Manager Nodes 110, 120 and a Hadoop-cluster 130.
Manager Node 1 110, which may be run as a single tenant deployment (e.g. a bare-metal deployment) and/or as a virtualized system (e.g. a VM on a cloud-based or multi-tenant server), comprises the following: a management component 112 manages and runs the data deployment, on the digital data processor of Manager Node 1 110, storage and operational functions, and otherwise facilitates the implementation of the methods disclosed herein, using, for example, the PHEMI CentralTM software; a web server, proxy server, and/or load balancer functionality component 112, which may include NGINX;

and a master key distribution and management component 113, which may include MIT
Kerberos, for example. The second management node in this example, Manager Node 2 120, comprises: a supplemental management node 122 comprising a digital data processor for either customized data analysis as well as supplemental or complementary (e.g. for compute or communicative load balancing) management and running of the data deployment, storage, and operational functions, including, in some cases, to take instructions from or to work with Manager Node 1 110); a cluster-computing management function node 121 which implements management of distributed and/or clustered data storage resources, and may include an interface for programming data clusters and providing, for example, fault tolerance, redundancy, and parallelism (and may not be limited to Hadoop-based clusters and HDFS systems, but may interact with other distributed storage systems, including MapR-FS, Cassandra, OpenStack Swift, Amazon S3, Kudu, or a custom solution or file system); and other functional nodes 123 for implementing pre-existing or customized functional tools (e.g.
Zeppelin for data analysis of large data sets). In some cases, the supplemental management node 122 is used to generate data collections and datasets from stored data that can be loaded into formats compatible with, or facilitated for, communication, other application layer functionalities, or analysis tools (e.g. Spark DataFrames). As noted above, these nodes can be distributed across one or more master nodes in different combinations.
100851 Further referring to Figure 1, there is conceptually shown a Hadoop-based cluster 130. The cluster 130 will comprise of Hadoop master nodes (or more generally in Hadoop and non-Hadoop examples, a Master Node) and Hadoop data nodes (or more generally in Hadoop and non-Hadoop examples, a Data Node). In general, the master nodes will comprise of specially-programmed networked computing devices that provide distributed task management and process orchestration amongst the data nodes.
This task management and process orchestration may include data compute functional modules 132 and data management functional modules 133. The data management functional modules 133 may include, for example, MapReduce, and YARN, but also includes other programming interfaces and systems for managing and scheduling computing resources across storage resources. Some embodiments may provide for database-relating functionality, including across distributed data nodes, such as those provided by using database management components 133 which may implement, for example, MongoDB.
Such database management component 133 services implement management instructions from the data processor management nodes 112, 122 (e.g. the PHEMI Central configuration information). In embodiments using MongoDB, the MongoDB service runs in a phemi mongo container (or other contained or virtualized implementation, e.g. jail, VM) on all master nodes, running as a multi-member replica set. The data nodes will in general be the "workhorses" of a Hadoop cluster, where data is primarily stored and processed. Data nodes may in some embodiments consist primarily as resources having multiple attached disk drives for local storage and access to cluster data. In Figure 1, the storage functional modules 131 are implemented across the data notes using HDFS using Acumulo.
[0086] In Example 1, referred to above, the PHEMI Central software uses the Security-Enhanced Linux (SELinux) implementation of Red Hat Enterprise Linux 7.3 operating system. The PHEMI Central application includes the following components running on Manager Node: (i) PHEMI Central: the PHEMI Central application runs as the PHEMI Agile service which runs in the phemi_central Docker container on Manager Node 1 (for resilience, the container and service are also provisioned on Manager Node 2); (ii) NGINX: the NGINX service manages, redirects, and filters network traffic to the correct endpoints, and which runs in the phemi_nginx container on Manager Node 1 (for resilience, the container and service are also provisioned on Manager Node 2);
(iii) Kerberos Key Distribution Centre (or Kerberos KDC): PHEMI Central requires a Kerberos KDC in the enterprise Active Directory to manage principals and key distribution for end users. In addition, PHEMI Central hosts an MIT Kerberos KDC
server to store principals and distribute keys for system services. The local Kerberos KDC is hosted on Manager Node 1, with a second KDC configured on Manager Node for high availability. The PHEMI Central internal KDC operates in a relationship of cross-realm trust with Active Directory's KDC. In this exemplary embodiment, the PHEMI Central application also includes a Dockerized component running across master nodes, running MongoDB in coordination with the containers on Manager Node 1.

100871 Referring to Figure 2, there is shown a schematic of one embodiment of a system in accordance with the present disclosure. In accordance with this exemplary embodiment there are shown two manager nodes (mgrO 1 and mgr02) 210A and 210B, three master nodes (mas01 through mas03) 215A, 215B, and 215C, and four data nodes (dn001 through dn004) 280A, 280B, 280C, and 280D. System drives and other non-data drives (not shown) that provide storage to non-data nodes operational purposes (e.g.
storage used by manager or master nodes) are either redundant or RAIDed, depending on the deployment. Figure 2 shows Cluster nodes having restricted access to the following: a GbE network interconnect 250; DNS 260 and NTP 270 functionality; Cloudera Key Trustee Server and Key Trustee KNIS key management entities 240 (or other key management services); Kerberos KDC 230 comprising of services implementing ticket-granting server and authentication server, or services implementing instructions therefrom; an Active Directory/LDAP server 220 for user authentication. In some cases, a secure network connection should be used from the hosted machines to the customer network or end-user devices.
100881 In some embodiments, HDFS (or other big data/distributed file system) utilizes replication for fault-tolerance, fault-recovery, or process efficiency, with every block of data automatically replicated on multiple data nodes. A duplicate can be used as back-up, or different copies can be used as the "live" copy for data requests depending on resource availability and performance (although the latter purpose further requires updating across all duplicates in a different manner than when using duplicates as back-up). In addition, some embodiments may make a number of services available across a cluster. Services may be deployed across manager nodes, master nodes, and data nodes and therefore resiliency of services (including, for example, Hadoop services and services specific to PHEMI Central, such as the PHEMI Central application and PHEMI
Raindrop, NGINX, MongoDB, and others, as well as encryption keys, may be provided through redundant provisioning and failover scripts. In Hadoop-based embodiments, HDFS in general triplicates data by default, with each block of a file or of data isreplicated on three different machines. This means each block of data can be recovered with N + 2 redundancy. Different redundancy can be used. In some embodiments, clusters are protected from the customer data center and external environment using firewall rules. In general, however, there are no firewall rules within the cluster. Each PHEMI cluster node has unrestricted access to every other PHEMI cluster node, although in some embodiments an intra-cluster firewall may be implemented.
[0089] Referring again to Figure 2, each exemplary Manager Node 210A and comprises the following hardware (although other similar arrangements are possible):
dual power supplies; 2 x Intel Xeon v4 8-core, 2.5 GHz or better; 128 GB RAM, in 16 GB DIMM increments; OS: 2 x 120 GB SAS/SSD disks, RAID-1 configured;
/var/logs:
500 GB disk; and /var/data/phemi: 250 GB disk. Each shown master node 215A, 215B, 215C comprises the following hardware (although other similar arrangements are possible): Dual power supplies; 2 x Intel Xeon v4 8-core, 2.5 GHz or better;

RAM, in 16 GB DIMM increments; OS: 2 x 120 GB SAS/SSD disks, RAID-1 configured; and /var/logs: 500 GB disk. The shown four data nodes 280A-D
should be deployed with a RAID-10 configuration, each having a total size of 2 TB of the following types (285A-D): /var/data/phemi: 500 GB disk; NameNode: 500 GB disk (28;
JournalNode: 500 GB disk; and Zookeeper: 500 GB disk. Each of the disks 285A-D
may comprise spinning hard drives, flash, SSDs, or other types of data storage media.
[0090] In some systems, the ratio of master nodes to data nodes may be balanced in order to balance data storage and compute functions. In many exemplary configurations, these concerns are balanced. However, in other embodiments data storage nodes having more data storage resources (e.g. additional data drives) may be used as the amount of data increases to create a more storage-intensive system. On the other hand, data nodes may be added having more RAM or more powerful data processing components may be used for more compute-intensive systems or applications. In some embodiments, where data nodes are virtualized, VMs may be apportioned on the fly with more storage, for storage-intensive applications, or more RAM for more compute-intensive applications. In a balanced compute configuration, balanced CPU, memory, and storage may be preferred. Balanced compute may be preferable for the following applications:
Dataset manufacture; Ingest of complex file types at modest rates; Data science with fewer than 5 concurrent users; Proof-of-concept or pilot deployments where load profiles are not well understood. In an exemplary embodiment of a balanced storage-compute configuration, four data nodes could be exposed to offer 12 TB of usable space, with a compute capacity of 64 cores and 512 GB RAM across the cluster. In embodiments, storage-intensive configuration may be preferred and may differ from the balanced compute option by using larger chassis on the data nodes, which can thereby accommodate greater numbers of storage disks. This option may be preferred for storage-heavy workloads such as:
Document and data archives; ETL offloading; Genomic BAM/FASTQ files; Images;
and Data and files that are rarely accessed. In an exemplary embodiment of a storage-intensive configuration, four data nodes could be exposed to offer 24 TB of usable space, with a compute capacity of 64 cores and 512 GB RAM across the cluster. In embodiments, a compute-intensive configuration may be preferred and, in general, would differ from the balanced compute option by having more RAM on the data node.
This option may be preferred for compute-heavy workloads such as: Heavy data science workloads; 5+ concurrent data science users; Complex file types with high streaming ingest rates; or Workloads with high real-time or interactive components. In an exemplary embodiment of a compute-intensive configuration, the four data nodes could expose 12 TB of usable space, with a compute capacity of 64 cores and 1 TB RAM

across the cluster.
100911 Referring to Figure 3, an exemplary key-value schema is shown in 310. It is represented with key-value pair parts following a base representation, such as the Accumulo framework, and also may be customized to a specific user's implementation within embodiments disclosed herein. The aforementioned exemplary schema can be found in 310. In one embodiment, the detailed description of the key-value pair parts can be found in 320.
100921 Figures 4 and 5 show schematics for how individual data components are generated in logical rows in a given key-value store. In particular, Figure 5 shows how the data asset 420 (i.e. all logical rows for a data object) is developed based on governance and contextual information stored in a data repository and/or provided by the manager and/or master nodes (as shown collectively as 410). Each row in this example, comprises: a Collection ID 420D, which identifies all logical rows in a data asset, data object, or collection; a Row ID 420E, which uniquely identifies each logical row; Stn 420A, which provides an indicator of security, sensitivity or authorization and where n denotes the logical row number; Tsn 420B, which provides a time stamp and where n denotes the logical row number; Descr, 420E, which denotes a descriptor of the value; and Vali, 420C, which provides a value (which may be acquired directly from raw data or derived using DPF functions 430. In Figure 5, a similar process is shown, except additional functions or outputs 510 are applied to create additional logical rows that are based on raw data and other derived data.
100931 Referring to Figure 6, there is shown a schematic representation of the derivation of a context specific dataset. Data from each of the datasets 420 can be accessed and based on specific context and security (or sensitivity) tags, a new dataset 610 can be generated, and in some cases stored, for use by a given user or class of users.
Users (not shown) are given access to a dataset, although the dataset 610 can be dynamic in that a change to a security/sensitivity tag, or a governance requirement, may automatically cause the creation of a new dataset 610 or trigger a requirement that the user request a new dataset 610.
[0094] In accordance with one aspect, there is disclosed a computer-readable medium, having stored thereon instructions for execution by a computing device in network communication with a data storage system comprising a plurality of data storage components, each of said data storage components being in network communication, and configured for distributed storage of a plurality of data objects, each said data object comprising of a plurality of constituent data object components, the instructions executable to automatically implement the steps of the methods described herein.
100951 In accordance with one aspect, there are provided methods, systems, and devices that assess the risk of re-identification of a given dataset or collection of data.
Such dataset or collection of data may include a dataset derived in accordance with methods disclosed herein, or a collection of rows from a key-value store. In some embodiments, the risk of re-identification of a dataset or collection may be assessed by determining the likelihood or probability that a given set, row, or value can be correlated to an identifiable individual or subject. In some embodiments, a given derived dataset can be associated with a risk of re-identification, wherein such a risk provides an indication of a probability that any given data object within the key value store that is made part of a derived dataset can be associated with an identifiable individual or subject to which the data object pertains. The higher such probability, the greater the risk re-identification indication. This risk indication may also be increased depending on the nature of the data object; for example, if the data object comprises sensitive personal information, such as but not limited to personal health or personal financial information. In general, the risk of re-identification will decrease if personally identifying information can be withheld from a dataset or obfuscated within a dataset. To the extent that this does not impact the informational value a dataset, or minimally impacts the informational value of a dataset, the re-identification risk can be used to optimally provide informational value while protecting the identity of the subjects of the information within the dataset.
[0096] In some such embodiments, the re-identification risk is a measurement of (a) the likelihood that any data object or data component thereof, or collection thereof, can be linked or associated with the subject or subjects to which it pertains. The number of same or similar data components within a dataset or other collection (that may or may not refer to other subjects) can be used to provide such an assessment of re-identification risk.
In some embodiments, the assessment can provide the k-anonymity property of a given data set, although other methods of assessing re-identification risk that may be known to persons skilled in the art can be used, including t-closeness, /-diversity, and privacy differential. k-anonymity is a property of a given datum, or set of data (including one or more rows) indicating that such datum or set of data cannot be distinguished from k-1 corresponding data or sets of data; an assessment of k-anonymity may be applied in respect of a particular field or type of metadata in a dataset. The k-anonymity property of data is described in Samarati, Pierangela; Sweeney, Latanya (1998).
"Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression", Harvard Data Privacy Lab, which is incorporated by reference herein.
t-closeness, /-diversity, and privacy differential utilize statistical models to provide an indication of similarity between a given data component within a dataset that is used to calculate a risk of re-identification. See Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian (2007). "t-Closeness: Privacy beyond k-anonymity and 1-diversity", ICDE, Purdue University; and Dwork, Cynthia (2006). "Differential Privacy"
ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II, Pages 1-12, which is incorporated by reference herein. In some embodiments, a risk of re-identification is assessed for a given data set and an acceptable threshold may be applied for a given dataset and/or in respect of a particular field or type of metadata within such dataset. For example, for a dataset comprising personal health information ("PHI") and non-PHI, a re-identification risk in respect of the PHI data may be provided for the dataset, as well as another re-identification risk in respect of the non-PHI data may be provided. In another example, for any value that is, or any data set that includes PHI, or other sensitive information (e.g. personal financial, insurance, or other sensitive information), different acceptable threshold risks of re-identification may be applicable than for datasets that do not include PHI.
100971 In embodiments, upon generating a derived dataset, a risk of re-identification is determined for said dataset. In other embodiments, the re-identification risk may be determined thereafter. Depending on the determined risk, as well as other factors, the dataset may be made available to particular users. This availability may be a function of sensitivity of values on the dataset (e.g. whether it contains PHI or personal financial information ("PFI")), or the risk of re-identification, or the role or trust-level of the person/entity to whom the dataset is being made available (e.g. physician, researcher, bank teller, etc.), or the nature of the availability (e.g., transmission of a new dataset or access to a centralized repository), or the location of the user (e.g. remote laptop, remote server, server room, etc.), or a combination thereof.
[0098] In some embodiments, the re-identification risk may be associated with the concept of zones of trust, or location-based de-identification controls. In general, when datasets are de-identified, the dataset is then sent to (or made available to) approved targets without reference to the location of the target or the security features/risks associated with such a target's location. This may expose a potential risk of re-identification. In embodiments, there may be determined a Risk Acceptability Threshold (RAT) based on a determination of the specific risks associated with the circumstances, such circumstances including the dataset risk or sensitivity (which relates to one or both of a re-identification risk and/or the sensitivity of such data), an indication of user trust (relating to a level of authorization or trust associated with a given user or entity in association with, in some embodiments, a sensitivity or sensitivities of the data set), and a location-based and/or security-based risk assessment of the computing devices to where the data set is to be provided (which may include associated or intermediary computing devices ¨ e.g. if a computing device is highly secure, but it must be transmitted or conveyed thereto via less secure intermediary devices, this may be taken into consideration in some embodiments). For example, RAT may be determined as Max(Dataset risk, User trust, Location controls). An exemplary process in accordance with embodiments hereof, may include: (1) optionally first determining an RAT
associated with a particular collection of data; (2) apply de-identification or obfuscation to specific fields in accordance with methods disclosed hereunder to generate a de-identified dataset; (3) Calculate the risk for each record (e.g. data component) in the dataset using a re-identification risk calculation algorithm (e.g. k-anonymity determination algorithm); (4) Apply a filter to the data to meet the Risk Acceptability Threshold; (5) Restrict the dataset destination to only those targets that meet the Risk Acceptability Threshold. The location-control indication may be a pre-determined value associated with specific types of locations, or it may be determined in an ad hoc manner based on access or security characteristics associated with a specific location. For example, if a given dataset is associated with a 10% RAT, the dataset could be restricted to locations that meet the necessary location-control indication. In such an example, PHEMI Central may restrict target-locations such that 10% RAT can only be sent to a secure research environment and not, for example, downloaded to a user's laptop.
Contrasting this with another dataset that may be de-identified to a 1% RAT
where it may then be downloaded to a user's laptop. In some embodiments, the location-control indication may be associated with a "zone of trust", within which, possibly based on the security and/or ability for third-parties to access, may allow for the provision of more sensitive or risky data sets. Such zones of trust may be determined in advance or dynamically depending on criteria relating to security or indications of such security;
either such case, whether pre-determined or dynamically determined based on criteria and/or circumstances, would constitute a designated zone of trust.

[0099] In some embodiments, there are provided systems and methods for dynamically deriving additional data components associated with an existing dataset that modify the re-identification risk. For example, if a given dataset includes data components that present a given k-anonymity property (or other re-identification risk determination) that is too high for release to, or use by, a given user or at a user location, additional data components may be derived for a different dataset that, while relating to the same data objects, increase the k-anonymity score. This might include replacing all data components appearing within the data set that include an age, with a data component that uses a date range. While this may minimally reduce the informational effectiveness for a researcher, for example, it may nevertheless reduce the re-identification risk significantly as the number of same or similar rows will be increased. In some embodiments, the possible users, locations, and/or user-location combinations that can access or have the dataset delivered thereto will be increased. Since there is a metric (e.g.
RAT) applied to dataset risk, user trust, and location-risk, the system can automatically derive further obfuscated data components for generating new datasets. In some embodiments, the user can indicate which fields should be preferentially obfuscated (or further obfuscated) so as to minimally impact informational effectiveness.
[00100] In some embodiments, selectively fulfilling a data request means that a request may or may not be fulfilled. The request may be fulfilled in some embodiments, for example, when a risk of re-identification, as indicated by the re-identification risk value associated with a data request, is lower than would be required under the circumstances. Such circumstances may include but is not limited to: the types of sensitivity (which may be referred to in some cases as an authorization level) associated with the data being returned in response to a data request; whether or not the request has originated from, or the data is being provided to or accessed from, a designated zone of trust; and/or the identity, role or other characteristic of the individual or entity making the data request. Notably, selectively fulfilling includes circumstances where the context-specific data set may not be provided. In such cases, some but certainly not all embodiments may result in further actions, such as but not limited to dynamically creating new data sets based on other key-value logical rows that have been further obfuscated, dynamically creating new but further obfuscated key-value logical rows, or limiting distribution to (or access from) certain types of designated zones of trust.
1001011 While the present disclosure describes various embodiments for illustrative purposes, such description is not intended to be limited to such embodiments.
On the contrary, the applicant's teachings described and illustrated herein encompass various alternatives, modifications, and equivalents, without departing from the embodiments, the general scope of which is defined in the appended claims. Except to the extent necessary or inherent in the processes themselves, no particular order to steps or stages of methods or processes described in this disclosure is intended or implied. In many cases the order of process steps may be varied without changing the purpose, effect, or import of the methods described.
[00102] Information as herein shown and described in detail is fully capable of attaining the above-described object of the present disclosure, the presently preferred embodiment of the present disclosure, and is, thus, representative of the subject matter, which is broadly contemplated by the present disclosure. The scope of the present disclosure fully encompasses other embodiments which may become apparent to those skilled in the art, and is to be limited, accordingly, by nothing other than the appended claims, wherein any reference to an element being made in the singular is not intended to mean "one and only one" unless explicitly so stated, but rather "one or more." All structural and functional equivalents to the elements of the above described preferred embodiment and additional embodiments as regarded by those of ordinary skill in the art are hereby expressly incorporated by reference and are intended to be encompassed by the present claims. Moreover, no requirement exists for a system or method to address each and every problem sought to be resolved by the present disclosure, for such to be encompassed by the present claims. Furthermore, no element, component, or method step in the present disclosure is intended to be dedicated to the public regardless of whether the element, component, or method step is explicitly recited in the claims.
However, that various changes and modifications in form, material, work-piece, and fabrication material detail may be made, without departing from the spirit and scope of the present disclosure, as set forth in the appended claims, as may be apparent to those of ordinary skill in the art, are also encompassed by the disclosure.
[00103] While the present disclosure describes various exemplary embodiments, the disclosure is not so limited. To the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the general scope of the present disclosure.

Claims (29)

What is claimed is:
1. A data storage system for fulfilling a data request for a context-specific data set, said context-specific data set based on a raw data set, the system comprising:
a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of data objects;
a digital data processor responding to data access requests received over a network and relating to the data objects;
a key-value store comprising a unique key-value logical row for each constituent data component of each of said data objects, each said unique key-value logical row comprising:
a key for identifying said unique key-value logical row;
a constituent data component value comprising stored digital information relating to said constituent data component associated with said unique key-value logical row; and a metadata descriptor describing metadata of said constituent data component value;
wherein at least one key-value logical row for a given data object is a direct key-value logical row directly associated with the raw data set and wherein at least one key-value logical row for the given data object is a derived key-value logical row derived from one or more other key-value logical rows;
wherein, upon said digital data processor generating the context-specific data set responsive to a given data request to the data storage system, said digital data processor further generates a re-identification risk value for the context-specific data set to be associated therewith, said re-identification risk value representative of a likelihood that a given constituent data component used to generate the context-specific data set can be directly associated with an identifiable subject to which said given constituent data component pertains;
and wherein said given data request is selectively fulfilled by the data storage system as a function of said re-identification risk value.
2. The data storage system of claim 1, wherein said re-identification risk value is generated based on similarities between an aspect of the given constituent data component, and a corresponding aspect of at least one other constituent data component used in the context-specific data set.
3. The data storage system of claim 1, wherein the re-identification risk is generated based on at least one of the following calculated properties of the aspect of the context-specific data set: k-anonymity, t-closeness, l-diversity, and privacy differential.
4. The data storage system of any one of claims 1 to 3, wherein each key-value logical row further comprises a sensitivity value indicating a sensitivity associated with a corresponding key-value logical row.
5. The data storage system of claim 4, wherein the sensitivity value is associated with one or more of the following: a permissible requesting user identifier for the corresponding key-value logical row or an aspect thereof, a predetermined sensitivity tag associated with one or more aspects of the corresponding key-value logical row, the raw data sets from which the corresponding key-value logical row originated, or the metadata descriptor of the corresponding key-value logical row.
6. The data storage system of any one of claims 1 to 5, wherein said given data request for the context-specific data set so generated in response thereto is selectively fulfilled solely upon said re-identification risk value associated with the context-specific data set being lower than a designated re-identification risk threshold.
7. The data storage system of claim 6, wherein said re-identification risk threshold is automatically determined by the data storage system based on whether a requesting computing device is within a designated zone of trust.
8. The data storage system of claim 6, wherein said re-identification risk threshold is automatically determined by the data storage system based on one or more of:
an identity of a requesting user, a role of the requesting user, sensitivity of data components in the context-specific data set, a location of a requesting computing device, a security indication of the requesting computing device, or a combination thereof.
9. The data storage system of any one of claims 1 to 8, wherein at least one said derived key-value logical row is automatically generated from the raw data set upon importing such raw data set into the data storage system.
10. The data storage system of any one of claims 1 to 8, wherein at least one said derived key-value logical row is derived upon request for such derivation by a user of the system.
11. The data storage system of any one of claims 1 to 8, wherein at least one said derived key-value logical row is automatically derived from one or more pre-existing direct or derived key-value logical rows that are associated with said given data request so to automatically increase a similarity between corresponding aspects of constituent data components derived from said one or more existing key-value logical rows and thus reduce a given re-identification risk value associated with a derived context-specific data set relying on said at least one derived key-value logical row given said similarity increase.
12. The data storage system of claim 11, wherein said derived key-value logical row is derived by obfuscating the constituent data component value of said pre-existing key-value logical rows to generate the constituent data component value of the derived key-value logical row, and the corresponding metadata descriptor of the derived key-value logical row being generated based on said obfuscating.
13. The data storage system of claim 11 or 12, wherein said derived context-specific data set is automatically generated by the data storage system upon said re-identification risk value associated with a first context-specific data set being too high to permit selective fulfilment of said given data request.
14. The data storage system of any one of claims 11 to 13, wherein said derived context-specific data set is generated automatically upon said re-identification risk value associated with a first context-specific data set being higher than a first designated threshold.
15. A data storage method for fulfilling a data request for a context-specific dataset based one or more raw data sets, the method implemented on a data storage system comprising a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of data objects, and a digital data processor for responding to data storage requests received over a network and relating to said data objects, the method comprising:
storing a key-value store comprising a unique key-value logical row for each constituent data component of each data object, each key-value logical row comprising:
a key for identifying the key-value logical row;
a constituent data component value comprising stored digital information relating to the constituent data component associated with the key-value logical row; and a metadata descriptor describing metadata of a data component value;
directly generating at least one of the key-value logical rows for a given data object from raw data;

deriving at least one of the key-value logical rows for the given data object from other key-value logical rows;
generating the context-specific data set responsive to the data request generating, a re-identification risk value for the context-specific data set, the re-identification risk value indicating a likelihood that a given constituent data component used to generate the context-specific data set can be directly associated with an identifiable subject to which said given constituent data component pertains; and selectively fulfilling the context-specific data request as a function of said re-identification risk value.
16. The method of claim 15, wherein said re-identification risk value is generated based on similarities between an aspect of the given constituent data component, and a corresponding aspect of at least one other constituent data component used in the context-specific data set.
17. The method of claim 16, wherein said re-identification risk value is generated based on at least one of the following calculated properties of the aspect of the context-specific data set: k-anonymity, t-closeness, l-diversity, or privacy differential.
18. The method of any one of claims 15 to 17, wherein each key-value logical row further comprises a sensitivity value indicating a sensitivity associated with the corresponding key-value logical row.
19. The method of claim 18, wherein the sensitivity value is associated with one or more of the following: a permissible requesting user identifier for the corresponding key-value logical row or an aspect thereof, a predetermined sensitivity tag associated with one or more aspects of the corresponding key-value logical row, the raw data sets from which the corresponding key-value logical row originated, or the metadata descriptor of the corresponding key-value logical row.
20. The method of any one of claims 15 to 19, said selectively fulfilling comprises fulfilling the data request solely upon said re-identification risk value associated with the context-specific data set being lower than a designated risk threshold.
21. The method of claim 20, wherein said risk threshold is determined based on whether a requesting computing device is within a designated zone of trust.
22. The method of claim 20, wherein said risk threshold is determined based on one or more of: an identity of a requesting user, a role of the requesting user, sensitivity of data components in the context-specific data set, a location of a requesting computing device, a security indication of the requesting computing device, or a combination thereof.
23. The method of any one of claims of 15 to 22, further comprising:
automatically generating a derived context-specific data set to fulfil the data request, wherein the derived context-specific data set is based on at least one derived key-value logical row that is automatically derived from one or more pre-existing direct or derived key-value logical rows associated with the data request so to automatically increase a similarity between corresponding aspects of constituent data components derived from said one or more pre-existing key-value logical rows and thus reduce a given re-identification risk value associated with said derived context-specific data set given said similarity increase.
24. The method of claim 23, wherein the at least one derived key-value logical row is derived by obfuscating the constituent data component value of the corresponding one or more pre-existing key-value logical rows to generate the constituent data component value of the derived key-value logical row, and the corresponding metadata descriptor of the derived key-value logical row being generated based on said obfuscating.
25. The method of claim 23 or 24, wherein the derived context-specific data set is generated upon the re-identification risk associated with a first context-specific data set being too high to permit selective fulfilment of the data request.
26. The method of any one of claims 20 to 22, wherein the derived context-specific data set is generated automatically upon the re-identification risk value associated with a first context-specific data set being higher than a first designated risk threshold.
27. A device for fulfilling a data request for a context-specific dataset based on an existing raw data set, the device being in network communication with a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication, and configured for distributed storage of data objects, the device comprising:
a digital data processor for responding to data storage requests received over a network and relating to said data objects; and a network communications interface for communicatively interfacing one or more requesting users and a key-value store configured to store a unique key-value logical row for each constituent data object component of each data object, each such key-value logical row comprising:
a key for identifying the key-value logical row;
a constituent data component value comprising stored digital information relating to the constituent data component associated with the key-value logical row; and a metadata descriptor describing metadata of the constituent data component value;
wherein at least one key-value logical row for a given data object is a direct key-value logical row directly associated with raw data and wherein at least one key-value logical row for the given data object is a derived key-value logical row derived from one or more other key-value logical rows; and wherein, upon said digital data processor generating the context-specific data set responsive to a given data request to the data storage system, said digital data processor further generates a re-identification risk value for the context-specific data set to be associated therewith, said re-identification risk value representative of a likelihood that a given constituent data component used to generate the context-specific data set can be directly associated with an identifiable subject to which said given constituent data component pertains;
and wherein said given data request is selectively fulfilled by the data storage system as a function of said re-identification risk value.
28. A computer-readable medium having stored thereon instructions for execution by a computing device for fulfilling a data request for a context-specific dataset based on an existing raw data set, said computing device being in network communication with a data storage system comprising a plurality of data storage components, each of said data storage components being in network communication, and configured for distributed storage of a plurality of data objects, each said data object comprising of a plurality of constituent data object components, the instructions executable to automatically implement the steps of any one of the methods of claims 15 to 26.
29. A data storage system for fulfilling a data request for a context-specific data set, said context-specific data set based on a raw data set, the system comprising:

a plurality of network-accessible hardware storage resources, each of said hardware storage resources being in network communication and configured for distributed storage of data objects;
a digital data processor responding to data access requests received over a network and relating to the data objects;
a key-value store comprising a unique key-value logical row for each constituent data component of each of said data objects, each said unique key-value logical row comprising:
a key for identifying said unique key-value logical row;

a constituent data component value comprising stored digital information relating to said constituent data component associated with said unique key-value logical row; and a metadata descriptor describing metadata of said constituent data component value;
wherein, in response to a given data request, said digital data processor:
generates a first context-specific data set based on existing key-value logical rows;
associates a re-identification risk value with said first context-specific data set representative of a likelihood that a given constituent data component used to generate said first context-specific data set can be directly associated with an identifiable subject to which said given constituent data component pertains;
selectively fulfils said given data request based on said re-identification risk value by:
providing access to said first context-specific data set upon said re-identification risk satisfying a designed risk criteria;
otherwise automatically generating and providing access to a derived context-specific data set so to fulfil the data request, wherein the derived context-specific data set is based on at least one derived key-value logical row that is automatically derived from one or more pre-existing direct or derived key-value logical rows associated with the data request so to automatically increase a similarity between corresponding aspects of constituent data components derived from said one or more pre-existing key-value logical rows and thus reduce a given re-identification risk value associated with said derived context-specific data set given said similarity increase.
CA2986320A 2017-10-10 2017-11-21 Methods and systems for context-specific data set derivation from unstructured data in data storage devices Pending CA2986320A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CA2986320A CA2986320A1 (en) 2017-11-21 2017-11-21 Methods and systems for context-specific data set derivation from unstructured data in data storage devices
PCT/CA2018/051268 WO2019144214A1 (en) 2017-10-10 2018-10-09 Methods and systems for context-specific data set derivation from unstructured data in data storage devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CA2986320A CA2986320A1 (en) 2017-11-21 2017-11-21 Methods and systems for context-specific data set derivation from unstructured data in data storage devices

Publications (1)

Publication Number Publication Date
CA2986320A1 true CA2986320A1 (en) 2019-05-21

Family

ID=66811205

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2986320A Pending CA2986320A1 (en) 2017-10-10 2017-11-21 Methods and systems for context-specific data set derivation from unstructured data in data storage devices

Country Status (1)

Country Link
CA (1) CA2986320A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723645A (en) * 2020-04-24 2020-09-29 浙江大学 Multi-camera high-precision pedestrian re-identification method for in-phase built-in supervised scene
CN111859447A (en) * 2020-07-03 2020-10-30 南京信息职业技术学院 Spark workflow scheduling method and system with privacy protection function
US20210342482A1 (en) * 2019-05-14 2021-11-04 Google Llc Automatically detecting unauthorized re-identification
US11741262B2 (en) 2020-10-23 2023-08-29 Mirador Analytics Limited Methods and systems for monitoring a risk of re-identification in a de-identified database
US12135820B2 (en) 2023-06-15 2024-11-05 Google Llc Automatically detecting unauthorized re-identification

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210342482A1 (en) * 2019-05-14 2021-11-04 Google Llc Automatically detecting unauthorized re-identification
US11720710B2 (en) * 2019-05-14 2023-08-08 Google Llc Automatically detecting unauthorized re-identification
CN111723645A (en) * 2020-04-24 2020-09-29 浙江大学 Multi-camera high-precision pedestrian re-identification method for in-phase built-in supervised scene
CN111723645B (en) * 2020-04-24 2023-04-18 浙江大学 Multi-camera high-precision pedestrian re-identification method for in-phase built-in supervised scene
CN111859447A (en) * 2020-07-03 2020-10-30 南京信息职业技术学院 Spark workflow scheduling method and system with privacy protection function
US11741262B2 (en) 2020-10-23 2023-08-29 Mirador Analytics Limited Methods and systems for monitoring a risk of re-identification in a de-identified database
US12135820B2 (en) 2023-06-15 2024-11-05 Google Llc Automatically detecting unauthorized re-identification

Similar Documents

Publication Publication Date Title
US10983963B1 (en) Automated discovery, profiling, and management of data assets across distributed file systems through machine learning
US11762815B2 (en) Multi-framework managed blockchain service
US10972506B2 (en) Policy enforcement for compute nodes
EP3356964B1 (en) Policy enforcement system
US9176994B2 (en) Content analytics system configured to support multiple tenants
Yang et al. Implementation of a big data accessing and processing platform for medical records in cloud
US8914323B1 (en) Policy-based data-centric access control in a sorted, distributed key-value data store
CA2986320A1 (en) Methods and systems for context-specific data set derivation from unstructured data in data storage devices
Kaur et al. Blockchain‐based framework for secured storage, sharing, and querying of electronic healthcare records
US11106813B2 (en) Credentials for consent based file access
Sawant et al. Big data application architecture
Jianmin et al. An improved join‐free snowflake schema for ETL and OLAP of data warehouse
CA2982062A1 (en) Methods and systems for context-specific data set derivation from unstructured data in data storage devices
Galletta et al. An approach to share MRI data over the Cloud preserving patients' privacy
Balamurugan et al. An efficient framework for health system based on hybrid cloud with ABE-outsourced decryption
WO2019144214A1 (en) Methods and systems for context-specific data set derivation from unstructured data in data storage devices
Lebre et al. Decentralizing the storage of a DICOM compliant PACS
US11443056B2 (en) File access restrictions enforcement
Kumar et al. Big data issues and challenges in 21st century
US12135735B2 (en) Management system and method for a distributed multi-model database architecture
EP4254219B1 (en) Management system and method for a distributed multi-model database architecture
Kaur A survey on big data storage strategies
Nasrullah et al. A Study of Performance Evaluation and Comparison of NOSQL Databases Choosing for Big Data: HBase and Cassandra Using YCSB
US11188680B2 (en) Creating research study corpus
WO2023056547A1 (en) Data governance system and method

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927