WO2023200534A1

WO2023200534A1 - Rule-based data governance and privacy engine

Info

Publication number: WO2023200534A1
Application number: PCT/US2023/014984
Authority: WO
Inventors: Sr. David A. FREELS; John RIEWERTS; Keith Jordan
Original assignee: Acxiom Llc
Priority date: 2022-04-13
Filing date: 2023-03-10
Publication date: 2023-10-19

Abstract

A data rules engine and data processing engine provide a means of capturing data without prior knowledge of its use while providing a dynamic mechanism to ensure compliance with privacy-based laws, rules, and regulations. Labels are used to expose characteristics of the data elements. Actions define what the system is allowed to do with the data elements. Intent defines the intended use of the data. Locations define the locations where the data may be stored in various states. By using these elements together, data may be stored and used in a compliant manner, even if the use of the data changes over time or if the privacy rules related to the data change over time.

Description

RULE-BASED DATA GOVERNANCE AND PRIVACY ENGINE

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001 ] This application claims the benefit of U.S. provisional patent application no. 63/330,387, filed on April 13, 2022. Such application is incorporated herein by reference in its entirety.

BACKGROUND

[0002] As more and more data is being used by businesses, it would be desirable to have a system that allows data to be captured without the requirement of having to know how it will be used beforehand. A rules-based approach to the usage of data based on intent of usage, target (user, process), and location (zones) is desirable to ensure that the usage of different types of data is compliant with policies and local laws even as the nature of said policies and laws changes over time.

SUMMARY

[0003] The present invention provides a mechanism that allows data to be consumed and processed without applying data privacy rules until the data are to be used. The data privacy rules not only are applied to the data structures as they are in pre-existing systems but also add the ability to consider each data point within each row for additional data privacy restrictions. This allows the data processing engine the ability to set privacy rules for every data point up to the moment where it is to be consumed. Additionally, cataloged data privacy rules are applied during the consumption of data in a manner which allows usage to change up to an until the data is sent to the user.

[0004] Most existing systems apply privacy and governance rules against the structure of the data (columns, tables, etc.). In certain embodiments, the present invention is directed to a system that applies privacy and governance rules to data elements within each materialized row (i.e.., rows having data values in data elements). The system evaluates each materialized row against all available privacy rules in a way that allows the system to restrict single rows based on the presence and type of data contained therein. The system also allows rows with restricted columns (i.e., data elements) where the data in those columns is either empty or in a state whereby it may be unrestricted according to the data privacy rules. As an example, the user selects three columns of data whose data elements are not sensitive. Since no data privacy rules exist which constrain the use of those data elements the data should be allowed to be read without restriction. The user then selects an additional column (e.g., date of birth) which when combined with the first three columns restrict the use of the data. The data rules engine in certain embodiments evaluates each materialized row to verify which rows have data populated with restricted data and then decides whether either the materialized row must be restricted or the newest column must be restricted (e.g., encrypted, hashed, removed, etc.) By applying the data privacy rules to the actual data, the system dynamically changes the way the data is used without restructuring the data. This process also allows the system to present virtual data zones for data that may be commingled on physical storage.

[0005] These and other features, objects and advantages of the present invention will become better understood from a consideration of the following detailed description of the preferred embodiments and appended claims in conjunction with the drawings as described following:

DRAWINGS

[0006] Fig. 1 is an overall architectural view of a data rules engine according to an embodiment of the present invention.

[0007] Fig. 2 shows the process where the data rules engine takes targets and data labels to generate the allowed actions according to an embodiment of the present invention.

[0008] Fig. 3 is an overall process flow for a data processing engine according to an embodiment of the present invention.

[0009] Fig. 4 is a more detailed process flow for labeling at a data processing engine according to an embodiment of the present invention.

[0010] Fig. 5 is a process flow for the use of machine learning to provide labeling at a data processing engine according to an embodiment of the present invention.

[0011 ] Fig. 6 is a data flow showing pipelines within an embodiment of the present invention.

[0012] Fig. 7 is a data flow showing the lifecycle of a data request according to an embodiment of the present invention.

DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS

[0013] Before the present invention is described in further detail, it should be understood that the invention is not limited to the particular embodiments described, and that the terms used in describing the particular embodiments are for the purpose of describing those particular embodiments only, and are not intended to be limiting, since the scope of the present invention will be limited only by the claims.

[0014] Fig. 1 provides an overall architecture of a system according to an embodiment of the invention. A data layer 14 includes a data rules engine 10, a data processing engine 12, and data zones 16. The data layer will be accessed by various system tools 8. No direct access to the data is allowed. All access must go through the data layer 14. Data rules engine 10 provides a dynamic mechanism to ensure compliance using a series of elements pertaining to the data. The rules established by the data rules engine 10 are organized into profiles that are assigned to a user at runtime. In an embodiment, a user may operate within at most one profile at a time. As additional datasets are brought into the request, the data processing engine 12 continuously evaluates using the data rules engine 10 to ensure that the rules are enforced.

[0015] The elements within the data rules engine 10 are used to dynamically establish what a process may do with the selected data sets. The data rules engine 10 takes a list of labels and targets for a request to produce a list of actions that may be taken against the data in the data set. The actions are definitive in that the least amount of privilege will be used. This process flow is illustrated in Fig. 3. At data processing engine 12, the first step is to gather labels 22. Then, data processing engine 12 identifies one or more targets 60, invokes rules 62, and finally receives actions 64. This process is repeated as additional data sets are joined to ensure that new data elements which may trigger a new rule get considered. Each of “labels,” “targets,” “rules,” and “actions” will be further explained below.

[0016] Referring now to Fig. 2, “labels” 22 are used to expose the characteristics of the data sets or data elements being targeted for access and/or processing. A label is applied to a data element (column, data set, combined column, etc.) and moves with that data element (and new elements generated from that element) as it is used in other data sets as well as joined with data sets. Labels are user defined and then used to train the machine learning models used when processing data. Illustrative examples of labels 22, as shown in Fig. 2, include personally identifiable data (PI I) 24, secure data 26, anonymous data 28, credit card data 30, and social security number (SSN) data 32.

[0017] “Targets” 34 identify a data location (zone), the requestor (user or processor) data routine, or the intent for which the data will be used. Fig. 1 illustrates a set of data zones 16. Data zones 16 are defined locations where data may be stored in various states. The quarantine zone 18 is predefined and required for storing data that is being ingested into the system. Quarantine zone 18 contains data that has not yet been labeled. As data enters the system it is stored in quarantine zone 18 prior to being labeled and routed. This may be physical storage for large datasets meant for batch processing or virtually in memory for streaming datasets. Quarantine zone 18 is a temporary holding area where the data receives pre-processing prior to being labeled. Examples of pre- processing include decrypting files and/or data elements, processing unstructured data to get valuable data elements, and extracting data elements from semi-structured data such as log files.

[0018] All other zones are defined zones 20, defined by the system administrator and combined with rules, labels 22, and targets 34 to establish routing. A defined data zone 20 may be physical including using different directories, buckets, or file systems as well as virtual which may use additional labeling to represent the zone. The data processing engine 12 enforces the separation of data at runtime based on the provided rules. As for data routing, when using a zone as a target 34, the rules engine 10 will determine the actions that are allowed. When data is joined with other data sets that may cross defined zones and then written for future use, the data set is routed to a zone that allows storage, then labeled and rules are enforced when using the data. When a physical storage location cannot be used to segregate data, then a virtual zone is created, which exists because the data processing engine 12 and the data rules engine 10 enforces data segregation. As an example, consider defining a public zone (defined zone 20) where most users have the ability to read data. In an embodiment, the rules engine 10 will need to restrict all write actions to that location for sensitive data such as credit card numbers or social security numbers. In the example of Fig. 1 , two defined zones 20 are shown, one containing enriched data and another containing activated data.

[0019] Once data has been labeled, metadata 100 about each dataset will be stored in order to optimize performance of retrieval requests. (Metadata 100 is not shown in Fig. 1 , but is illustrated in the data request lifecycle example of Fig. 7.) Metadata 100 is isolated from the managed data either physically or using other means such as labelling or naming. Metadata 100 may include but is not be limited to: labels and frequency count per dataset; common relationships between datasets; partition information; data required to build indices (which may contain sensitive data); and the nature of the data (e.g., structured, semistructured, or unstructured).

[0020] The target 34 intention type refers to the intended use of the data. Intent is a useful element in that it can be used to provide restrictions that would not normally be applied. Intent(ions) are defined by an administrator to be used by the data rules engine 10. An example would be a user (target) may be able to read (action) data that has been labeled as personally identifiable information (PH) when the intention is research, but if the intention is to run a marketing campaign, then the data may not be used. In this example, the target user can access the data for research such as doing a background check, but that data cannot be used to construct a campaign to sell advertisements. Fig. 2 shows some illustrative examples of targets 34 as named user target 36, intention target 38, process target 40, collaboration target 42, and zone target 44.

[0021 ] “Actions” 46 define what the system is allowed to do with the elements of the data set. The data rules engine 10 uses the data set labels 22, targets (location, user) 34, and predefined actions to determine a final set of actions 46 that may be performed to the data set. Actions should be considered to be applied at the row level except in some cases where the data has a special action. As an example, the user requests data that includes a date of birth but a rule exists that will only allow this user to see the year, the data processing engine 12 will apply a mask to that column.

[0022] In the case of two examples illustrated in Fig. 2, data processing engine 12 will use data rules engine 10 to dynamically generate actions 46 at a row level by comparing the labels 22 across all columns in the data set and across other data sets that are being joined. When joining one or more data sets, a row of data that is not considered PH 24 (e.g., date of birth, name suffix, etc.) may become PH 24 once joined with another data set that adds the missing pieces of data (e.g., first name, middle name, last name, etc.). This is accomplished using machine learning models to consider the joined data and labels to recommend additional labels for the combined data set. In Fig. 2, a first PH rule definition 54 pulls data with PH label 24, SSN label 32, and uses named user target 36. In this case, actions 46 allowed are write action 48, read action 50, and join action 52. In a second example of anonymous rule definition 56, data with the anonymous label 28 is drawn, with two targets 34 of intention 38 and zone 44. In this second case, the actions 46 include only the read action 50 and the join action 52; the write action 48 is not included.

[0023] The process of data labeling with labels 22 may now be described in more detail with reference to Fig. 4. As source data 66 enters data processing engine 12, it is routed through input router 72 and then evaluated using machine learning to determine the nature of the data. The data will be labeled at labeling 74 and then compared against existing data labels. Data that matches existing labels is then allowed to continue to the data processing engine 12 for processing and routing.

[0024] All data entering the system will be quarantined at quarantine zone

18. Registered data 68 (i.e., data from sources that have been registered with the system) will be allowed to proceed to the labeling phase at labeling 74. Any data that cannot be clearly labeled, which includes data from new sources (unregistered 70) as well as existing sources (registered data 68 that may have new elements), is quarantined pending a user review. Data labeling 74 is performed upon ingestion based upon the predictions that have been verified. Labels 22 are applied at the column level for structured and semistructured data and at the data set level for unstructured data. All labels 22 are validated against the known labels 22 of previously received data. The labels 22 are used by the data rules engine 10 to make determinations of which actions should be allowed.

[0025] Data from data processing engine 12 passes through data rules engine router 76 to reach data rules engine 10. Various labels 22 are applied, as described above. The labeled data 78 is then sent to a defined zone 20.

[0026] The machine learning component of data rules engine 10 may now be described with reference to Fig. 5. Machine learning is used to apply labels 22 to the data elements of source data 66. As new labels 22 are created or removed, models are marked as out of date and need to be re-trained at training 80. There are two levels of training: user training to provide the base models and system training that learns from the feedback provided by users 88 as corrections are made during the review of system suggested labels. Training data 82 is used to provide training 80 against source data 66 in order to apply labels 22 through data rules engine 10.

[0027] In the training 80 process, machine-learning powered labeling 74 may utilize multiple label models, with a model being applicable to each of the labels 22 to be used. Validation 84 is the process whereby the trained model resulting from the use of training data 82 is evaluated using existing labeled data 86. If the result of validation 84 is that the labels are determined to be inaccurate at decision box 88, then processing returns to training 80 for another attempt. If the result of validation 84 is that the labels are determined to be accurate at decision box 88, then the labeled data 78 is stored in defined zone 20 because labeling 74 is now complete.

[0028] As previously mentioned, source data 66 will be classified as structured, unstructured or semi-structured as it enters data processing engine 12. Each incoming source data 66 set will be identified, parsed, and then put into a format that may be processed further. Depending on the classification, varying degrees of pre-processing will be required. The tokens produced from this process are the data elements that are to be governed and secured.

[0029] Structured data is data whose elements are addressable for effective analysis. Relational data is an example of structured data. This data has keys that allow it to be easily mapped into pre-designed fields. Structured data is pre- processed to determine the structure characteristics such as column names, parsing instructions (separator, quote character, etc.) and character sets. This data is then sent to quarantine zone 18 for further analysis.

[0030] Semi-structured data is data that does not reside in a rigid structure (such as a table or relational database), but that still has some organizational properties that make it easier to analyze. XML and log data are examples of semi-structured data. Semi-structured data is pre-processed much like structured data, but some attributes may require additional changes. Each attribute needs to be considered to determine if it represents a single data element or if the element can be burst into multiple new elements. Maps/objects and arrays are examples of complex attributes that may be found on semistructured JSON data, for example. Log files may contain fields that have comma-separated values. Being able to identify the individual tokens allows greater control over privacy and governance with respect to semi-structured data

[0031 ] Unstructured data is data that is not organized in a predefined manner or does not have a predefined data model, and thus is not readily transferred into a structured database. Unstructured data requires pre-processing that tokenizes the data to produce a set of attributes. A special pre-processing pipeline is used to parse the entire data set, produce valid tokens, and then use machine-learning models to ensure that the tokens are valid. These models will initially be trained by an administrator but will also use a form of self-learning to improve as more unstructured data enters the system. Once the tokenized data set has been produced, the normal data label processing is used.

[0032] Referring now to Fig. 6, data pipelines through the system data rules engine 10 and data processing engine 12 may be described. Data processing engine 10 is responsible for all data requests and extract, transform, load (ETL) processing that is performed on the source data 66. The data processing engine 12 consists of one or more processes that organize, transform and segregate data elements to ensure compliance with the governance and privacy rules 90 defined in the system. An administrator is able to define data zones 44 (as described above) within the system that have additional rules 90 as to what types of data may be stored. Data zones may be physical or virtual.

[0033] Data pipelines shepherd data through the system, with each pipeline evaluating the active data set of source data 66 against the data rules engine 10 to ensure that the requested actions are allowed. Prior to writing the data set, the actions are evaluated to determine how the different parts of the data set must be stored (plain text, hashed, encrypted) and whether the zone(s) 44 can accept the data. As data sets are joined to create new data sets, machine learning is used to apply new labels 22 using the existing labels provided by each data element, as described above.

[0034] In the example of Fig. 6, it may be seen that data processing engine 12 is accessed from data rules engine 14 at access 92. At this point, access 92 may receive rules 90, zones 44, targets 34, and actions 46. Processing moves to anonymization 94 to anonymize the source data 66 according to rules 90. Data rules engine router (routing) 76 receives rules 90 and zones 44 in order to perform its operations. Labeling 74 likewise receives labels 22 from data rules engine 10. Each of labeling 74, data rules engine router 76, and access 92 maintain data pipelines to the various data zones, including quarantine zone 18 and one or more defined zones 20 (with two defined zones 20 shown in Fig. 6). [0035] Data processing engine 12 is used as the primary access mechanism when a data set must be secured. Data sets that are allowed by the data rules engine 10 to exist outside the system are delivered to a location where they can be accessed based on predefined rules 90 and approved actions 46.

[0036] Fig. 7 shows the lifecycle of a data request as a data flow. Step (a) represents the request being made from an API or Bl tool (within system tools 8) using, in this specific example, the Java Database Connectivity (JDBC) connector 96. The provided JDBC connector 96 invokes the data processing engine 12 providing the request at step (b). The action generator module 98 then reads in the stored metadata 100 about the data set(s) being accessed at step (c) from fetch labels submodule 108. Next, at step (d), labels 22 and targets 34 are sent to the data rules engine 10 to receive the actions 46 for the data elements within the data set(s) at get actions submodule 102 and identify targets submodule 110. At step (e) actions 46 are then passed to the data reader 104, which performs a fetch row operation at step (f), and enforces actions 46 on the data elements in each row at step (g) using the governance module 106 to ensure the correct behavior at step (h). Finally the data is streamed back in a response at step (i).

[0037] The governance module 106 is designed to enforce the behavior of an action 46. The requesting user may have access to read the requested data; however, that may have other connotations such as the data may never leave the system unencrypted or the data must be anonymized, for example. The intended usage of the data may also be a factor in how the data is presented. A user doing an audit may be allowed to see the data in plain text whereas a user building a marketing campaign may cause the data to be anonymized using a custom algorithm. Additionally, a data element may allow a user to read; however, the presence of another label 22 within the same row may change that behavior to anonymize.

[0038] All managed data is stored using filesystems/object stores (such as, but not limited to, S3, GCS, Blob, and HDFS formats) and encrypted while at rest (in certain embodiments, with minimum 256-bit keys) and with security controls (such as ACLs, restrict public access, etc.) in place to prevent unauthorized access. All security controls in certain embodiments are native to the storage provider. External access to data is only be allowed by using the interfaces provided by data processing engine 12. Any file system that can store data encrypted and be secured may be used for physical storage, but the most common types are AWS S3, GCP GCS, Azure Blob, and Hadoop HDFS.

[0039] The systems and methods described herein may in various embodiments be implemented by any combination of hardware and software. For example, in one embodiment, the systems and methods may be implemented by a computer system or a collection of computer systems, each of which includes one or more processors executing program instructions stored on a computer-readable storage medium coupled to the processors. The program instructions may implement the functionality described herein. The various systems and displays as illustrated in the figures and described herein represent example implementations. The order of any method may be changed, and various elements may be added, modified, or omitted.

[0040] A computing system or computing device as described herein may implement a hardware portion of a cloud computing system or non-cloud computing system, as forming parts of the various implementations of the present invention. The computer system may be any of various types of devices, including, but not limited to, a commodity server, personal computer system, desktop computer, laptop or notebook computer, mainframe computer system, handheld computer, workstation, network computer, a consumer device, application server, storage device, telephone, mobile telephone, or in general any type of computing node, compute node, compute device, and/or computing device. The computing system includes one or more processors (any of which may include multiple processing cores, which may be single or multi-threaded) coupled to a system memory via an input/output (I/O) interface. The computer system further may include a network interface coupled to the I/O interface.

[0041 ] In various embodiments, the computer system may be a single processor system including one processor, or a multiprocessor system including multiple processors. The processors may be any suitable processors capable of executing computing instructions. For example, in various embodiments, they may be general-purpose or embedded processors implementing any of a variety of instruction set architectures. In multiprocessor systems, each of the processors may commonly, but not necessarily, implement the same instruction set. The computer system also includes one or more network communication devices (e.g., a network interface) for communicating with other systems and/or components over a communications network, such as a local area network, wide area network, or the Internet. For example, a client application executing on the computing device may use a network interface to communicate with a server application executing on a single server or on a cluster of servers that implement one or more of the components of the systems described herein in a cloud computing or non-cloud computing environment as implemented in various subsystems. In another example, an instance of a server application executing on a computer system may use a network interface to communicate with other instances of an application that may be implemented on other computer systems.

[0042] The computing device also includes one or more persistent storage devices and/or one or more I/O devices. In various embodiments, the persistent storage devices may correspond to disk drives, tape drives, solid state memory, other mass storage devices, or any other persistent storage devices. The computer system (or a distributed application or operating system operating thereon) may store instructions and/or data in persistent storage devices, as desired, and may retrieve the stored instruction and/or data as needed. For example, in some embodiments, the computer system may implement one or more nodes of a control plane or control system, and persistent storage may include the SSDs attached to that server node. Multiple computer systems may share the same persistent storage devices or may share a pool of persistent storage devices, with the devices in the pool representing the same or different storage technologies. [0043] The computer system includes one or more system memories that may store code/instructions and data accessible by the processor(s). The system’s memory capabilities may include multiple levels of memory and memory caches in a system designed to swap information in memories based on access speed, for example. The interleaving and swapping may extend to persistent storage in a virtual memory implementation. The technologies used to implement the memories may include, by way of example, static random-access memory (RAM), dynamic RAM, read-only memory (ROM), non-volatile memory, or flashtype memory. As with persistent storage, multiple computer systems may share the same system memories or may share a pool of system memories. System memory or memories may contain program instructions that are executable by the processor(s) to implement the routines described herein. In various embodiments, program instructions may be encoded in binary, Assembly language, any interpreted language such as Java, compiled languages such as C/C++, or in any combination thereof; the particular languages given here are only examples. In some embodiments, program instructions may implement multiple separate clients, server nodes, and/or other components.

[0044] In some implementations, program instructions may include instructions executable to implement an operating system (not shown), which may be any of various operating systems, such as UNIX, LINUX, Solaris™, MacOS™, or Microsoft Windows™. Any or all of program instructions may be provided as a computer program product, or software, that may include a non-transitory computer-readable storage medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to various implementations. A non-transitory computer-readable storage medium may include any mechanism for storing information in a form (e.g., software, processing application) readable by a machine (e.g., a computer). Generally speaking, a non-transitory computer- accessible medium may include computer-readable storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM coupled to the computer system via the I/O interface. A non-transitory computer-readable storage medium may also include any volatile or non-volatile media such as RAM or ROM that may be included in some embodiments of the computer system as system memory or another type of memory. In other implementations, program instructions may be communicated using optical, acoustical or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.) conveyed via a communication medium such as a network and/or a wired or wireless link, such as may be implemented via a network interface. A network interface may be used to interface with other devices, which may include other computer systems or any type of external electronic device. In general, system memory, persistent storage, and/or remote storage accessible on other devices through a network may store data blocks, replicas of data blocks, metadata associated with data blocks and/or their state, database configuration information, and/or any other information usable in implementing the routines described herein.

[0045] In certain implementations, the I/O interface may coordinate I/O traffic between processors, system memory, and any peripheral devices in the system, including through a network interface or other peripheral interfaces. In some embodiments, the I/O interface may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory) into a format suitable for use by another component (e.g., processors). In some embodiments, the I/O interface may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. Also, in some embodiments, some or all of the functionality of the I/O interface, such as an interface to system memory, may be incorporated directly into the processor(s).

[0046] A network interface may allow data to be exchanged between a computer system and other devices attached to a network, such as other computer systems (which may implement one or more storage system server nodes, primary nodes, read-only node nodes, and/or clients of the database systems described herein), for example. In addition, the I/O interface may allow communication between the computer system and various I/O devices and/or remote storage. Input/output devices may, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or retrieving data by one or more computer systems. These may connect directly to a particular computer system or generally connect to multiple computer systems in a cloud computing environment, grid computing environment, or other system involving multiple computer systems. Multiple input/output devices may be present in communication with the computer system or may be distributed on various nodes of a distributed system that includes the computer system. The user interfaces described herein may be visible to a user using various types of display screens, which may include CRT displays, LCD displays, LED displays, and other display technologies. In some implementations, the inputs may be received through the displays using touchscreen technologies, and in other implementations the inputs may be received through a keyboard, mouse, touchpad, or other input technologies, or any combination of these technologies.

[0047] In some embodiments, similar input/output devices may be separate from the computer system and may interact with one or more nodes of a distributed system that includes the computer system through a wired or wireless connection, such as over a network interface. The network interface may commonly support one or more wireless networking protocols (e.g., Wi-Fi/IEEE 802.11 , or another wireless networking standard). The network interface may support communication via any suitable wired or wireless general data networks, such as other types of Ethernet networks, for example. Additionally, the network interface may support communication via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks, via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.

[0048] Any of the distributed system embodiments described herein, or any of their components, may be implemented as one or more network-based services in the cloud computing environment. For example, a read-write node and/or readonly nodes within the database tier of a database system may present database services and/or other types of data storage services that employ the distributed storage systems described herein to clients as network-based services. In some embodiments, a network-based service may be implemented by a software and/or hardware system designed to support interoperable machine-to-machine interaction over a network. A web service may have an interface described in a machine-processable format, such as the Web Services Description Language (WSDL). Other systems may interact with the network-based service in a manner prescribed by the description of the network-based service’s interface. For example, the network-based service may define various operations that other systems may invoke, and may define a particular application programming interface (API) to which other systems may be expected to conform when requesting the various operations.

[0049] In various embodiments, a network-based service may be requested or invoked through the use of a message that includes parameters and/or data associated with the network-based services request. Such a message may be formatted according to a particular markup language such as Extensible Markup Language (XML), and/or may be encapsulated using a protocol such as Simple Object Access Protocol (SOAP). To perform a network-based services request, a network-based services client may assemble a message including the request and convey the message to an addressable endpoint (e.g., a Uniform Resource Locator (URL)) corresponding to the web service, using an Internet-based application layer transfer protocol such as Hypertext Transfer Protocol (HTTP). In some embodiments, network-based services may be implemented using Representational State Transfer (REST) techniques rather than message-based techniques. For example, a network-based service implemented according to a REST technique may be invoked through parameters included within an HTTP method such as PUT, GET, or DELETE.

[0050] Unless otherwise stated, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present invention, a limited number of the exemplary methods and materials are described herein. It will be apparent to those skilled in the art that many more modifications are possible without departing from the inventive concepts herein.

[0051 ] All terms used herein should be interpreted in the broadest possible manner consistent with the context. When a grouping is used herein, all individual members of the group and all combinations and subcombinations possible of the group are intended to be individually included. When a range is stated herein, the range is intended to include all subranges and individual points within the range. All references cited herein are hereby incorporated by reference to the extent that there is no inconsistency with the disclosure of this specification.

[0052] The present invention has been described with reference to certain preferred and alternative embodiments that are intended to be exemplary only and not limiting to the full scope of the present invention, as set forth in the appended claims.

Claims

CLAIMS:

1 . A data governance and privacy system, comprising: a data rules engine configured to process at least one label and at least one target for a source data set and produce at least one rule for the source data set that delineates at least one action that may be taken against the source data set, wherein the at least one label comprises a characteristic of the source data set, the at least one target comprises a data location, a data requestor, or an intent for the source data set; a data processing engine in communication with the data rules engine, wherein the data processing engine is configured to gather the at least one label for the source data set, identify the at least one target for the source data set, invoke at least one rule for the source data set, and receive at least one action for the source data set; and a data zone in communication with data processing engine configured to store the source data set.

2. The system of claim 1 , wherein the data zone comprises a quarantine zone comprising data being ingested.

3. The system of claim 2, wherein the data zone further comprises at least one defined zone comprising data combined with at least one rule, at least one label, or at least one target.

4. The system of claim 3, wherein the data zone further comprises a metadata zone comprising metadata for the source data set, wherein the metadata zone is isolated from the at least one defined zone.

5. The system of claim 4, wherein the metadata for the source data set comprises one or more of labels and frequency count for the source data set, common relationships for the source data set, partition information for the source data set, data required to build indices for the source data set, or the structure of the source data set.

6. The system of claim 5, further comprising a machine learning system comprising a set of training data, wherein the data rules engine is configured to utilize the training data to produce at least one rule for the source data set.

7. The system of claim 6, further comprising a validation module to compare an output of the machine learning system using the set of training data with a set of existing labeled data to determine an accuracy of the at least one label applied to the source data set using the training data.

8. The system of claim 7, wherein the machine learning system comprises a separate model for each of the at least one label.

9. The system of claim 1 , wherein the at least one rule comprises anonymization of the source data set.

10. The system of claim 1 , further comprising at least one tool configured to send a request to the rules data engine for processing of the source data set.

11 . The system of claim 1 , wherein the data processing engine further comprises a governance module configured to implement at least one action for the source data set.

12. The system of claim 11 , wherein the data processing engine further comprises a data reader in communication with the governance module and the data layer, wherein the data reader is configured to perform a fetch row operation against the source data set and enforce actions on the source data set using the at least one rule from the governance module.

13. A process for rules-based data governance and privacy, comprising the steps of: receiving a source data set at a data layer; applying at least one label to the source data set, wherein the at least one label comprises a characteristic of the source data set; and applying at least one target to the source data set, wherein the at least one target comprises a data location, a data requestor, or an intent for the source data set.

14. The method of claim 13, wherein the data layer comprises a quarantine zone and at least one defined zone, and the step of receiving the source data set at the data layer comprises the step of receiving the source data set at the quarantine zone.

15. The method of claim 1 , further comprising the step of, after applying at least one label to the source data set and applying at least one target to the source data set, routing the source data set to the at least one defined zone.

16. The method of claim 15, further comprising the step of pre-processing the source data set prior to applying at least one label to the source data set.

17. The method of claim 16, wherein the source data set comprises either semi-structured data or unstructured data, and wherein the step of preprocessing the source data set comprises the step of extracting data elements from the source data set.

18. The method of claim 17, wherein the data layer further comprises a metadata zone, and wherein the method further comprises the step of storing metadata for the source data set in the metadata zone.

19. The method of claim 18, further comprising the step of applying a machine learning model to recommend at least one additional label for the source data set.

20. The method of claim 19, further comprising the step of validating the recommendation of at least one additional label for the source data set using at least one existing labeled data set.

21 . The method of claim 20, further comprising the step of receiving a user feedback and again performing the step of validating the recommendation of at least one additional label for the source data set using the user feedback.

22. The method of claim 21 , further comprising the step of, if the step of validating the recommendation of at least one additional label for the source data set using at least one existing labeled data set returns a result of an inaccurate label, again applying the machine learning model to recommend at least one corrected label for the source data set.