WO2016068883A1 - Entity anonymization for a query directed to a multiplex graph - Google Patents

Entity anonymization for a query directed to a multiplex graph Download PDF

Info

Publication number
WO2016068883A1
WO2016068883A1 PCT/US2014/062659 US2014062659W WO2016068883A1 WO 2016068883 A1 WO2016068883 A1 WO 2016068883A1 US 2014062659 W US2014062659 W US 2014062659W WO 2016068883 A1 WO2016068883 A1 WO 2016068883A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph
multiplex graph
multiplex
given
node
Prior art date
Application number
PCT/US2014/062659
Other languages
French (fr)
Inventor
Luis Miguel Vaquero Gonzalez
Sae Lor SUKSANT
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2014/062659 priority Critical patent/WO2016068883A1/en
Publication of WO2016068883A1 publication Critical patent/WO2016068883A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification

Definitions

  • the data may be organized in a database.
  • a relational database in which data is stored in tables.
  • a given table defines a relation among the data stored in the table; and relations may also exist among tables of the relational database.
  • a graph database which is based on a graph structure having nodes, properties and edges.
  • the nodes represent entities, and the properties are pertinent information that relate to the nodes and the edges.
  • the edges are the lines that connect nodes; and a given edge represents a relationship between connected nodes.
  • FIG 1 is a schematic diagram illustrating a query processing system according to an example implementation.
  • FIG. 2 is a flow diagram depicting a technique to entity anonymize a result for a query that spans a multiplex graph according to an example implementation.
  • FIG. 3 is a flow diagram depicting a technique to controllably distort a multiplex graph according to an example implementation.
  • Fig. 4 is an illustration of node replacement according to an example implementation.
  • Fig. 5 is a flow diagram depicting a technique to synthesize replacement nodes for a multiplex graph according to an example implementation.
  • FIG. 6 is a schematic diagram of a query processing system illustrating a physical machine of the system according to a further example implementation.
  • Graph database technology is ever increasingly being used by enterprises for tasks that involve searching connections among entities, businesses or any other items.
  • an enterprise may access a graph database for purposes of serving up online recommendations to millions of Internet users, managing master data hierarchies, or routing millions of packages per day in real time.
  • a graph database has a structure that is based on graph theory.
  • a graph may include nodes, properties and edges.
  • the nodes represent entities, such as people, businesses, accounts or any other item that is tracked.
  • the properties may include information that relates to the nodes and to the edges.
  • the edges include lines that connect the nodes to other nodes, and in general, an edge may represent a relationship between a given node and another node.
  • a query engine may traverse the nodes and edges of the graph.
  • a query refers to a request for information and may include, for example, a statement requesting information from a database.
  • Data contained in a given database may be relatively sensitive. As such, the data may be processed to anonymize the data. Anonymization of the data may prevent queries of the data from revealing information about participants from which the data was collected.
  • a graph database may store information pertaining to a census, a survey, human resource information, media records, and so forth.
  • a given graph may have one or more relationships to another graph.
  • a set of graphs, in which certain nodes of one graph represent real world entities in at least one of the other graphs is called a "multiplex graph.”
  • the multiplex graph may contain multiplex edges, in which an individual multiplex edge may represent a relationship between various nodes of the set of graphs. For example, a multiplex edge may represent a relationship between a first node of a first graph and a second node of a second graph.
  • a multiplex graph may be controllably distorted for purposes of preventing a query of the multiplex graph from revealing information about participants from which the data was collected.
  • a given multiplex graph may include a set of individual graphs, where an individual graph may represent information related to a particular field, of a particular scope, and/or otherwise representing a particular subsection of data.
  • the set of graphs associated with a given multiplex may include graphs of varying scope and with varying data fields.
  • An example query of a multiplex graph may be directed to a particular multiplex graph that contains the following individual graphs: a graph that derived from a business enterprise's employee profile database (for example, a database containing information such as employee names, social security numbers, home addresses, salaries, job titles, and so forth); and a graph derived from a health services database, which also contains information for one or multiple employees of the business organization.
  • a business enterprise's employee profile database for example, a database containing information such as employee names, social security numbers, home addresses, salaries, job titles, and so forth
  • a graph derived from a health services database which also contains information for one or multiple employees of the business organization.
  • Processing the query involves identifying entities (e.g., employees) that are represented by the nodes in these individual graphs and creating links across the graphs (i.e., multiplex edges).
  • entities e.g., employees
  • a given entity may be represented by corresponding nodes that appear in multiple graphs of the set of graphs forming the multiplex graph.
  • a query that spans a multiplex graph may potentially reveal sensitive personal information, such as, for example, medical histories of employees of the business organization.
  • Entity anonymization refers to a query processing technique that, in general, reduces the likelihood that the query reveals that a given entity is the same across two or more graphs of a multiplex graph.
  • an initial multiplex graph to be spanned by the query is first controllably distorted so that the query actually traverses the distorted multiplex graph instead of the initial multiplex graph.
  • a query processing system 100 may include a query engine 150 that provides a query result 160 in response to a query 105.
  • the query 105 spans an example multiplex graph 120 in that processing the query generally involves traversing the multiplex graph 120.
  • the multiplex graph 120 contains multiple graphs 124, which may be, in general, interlinked by multiplex edges (not shown) that define relationships between nodes 1 26 of the graphs 124.
  • the query processing system 100 may replace the initial multiplex graph 120 with a transformed multiplex graph 130 for processing for by the query engine 130.
  • the query processing system may include an anonymization engine 1 10 that controllably distorts the multiplex graph 120 to produce the transformed multiplex graph 130 for purposes of performing entity anonymization.
  • the transformed multiplex graph 130 may contain a set of interlinked graphs 134; and as described herein, the anonymization engine 1 10, in the controllable distortion, may preserve the topology of the initial multiplex graph 120, such as the motifs (subgraph patterns, for example), numbers of outgoing edges, clustering coefficients, and/or other topological properties of the multiplex graph 120. Although the topological properties may be preserved, the anonymization engine 1 10, in the controllable distortion, may change graph properties in a manner that prohibits, or at least reduces, the likelihood that the query result 160 reveals that a given entity exists across multiple graphs 124.
  • Fig. 2 depicts a technique to perform entity anonymization in accordance with example implementations.
  • the technique 200 includes receiving (block 204) data representing a given multiplex graph, such as receiving data that represents the multiplex graph 1 20 of Fig. 1 , for example.
  • the technique 200 includes entity anonymizing (block 208) the result for a query directed to the given multiplex graph to reduce the likelihood that the result of the query reveals an entity that is common to multiple graphs of the given multiplex graph, such as, for example, the use of an entity anonymization engine (the use of anonymization engine 1 10 of Fig. 1 , for example).
  • This entity anonymization includes processing the data to controllably distort the given multiplex graph to produce another multiplex graph (such as multiplex graph 130 of Fig. 1 , for example) to be processed for the query in place of the given multiplex graph.
  • controllable distortion of the multiplex graph may be performed in many different ways, depending on the particular implementation.
  • the controllable distortion may be performed by one or more of the following: node replacement, multiplex edge strength modulation and multiplex edge addition.
  • Fig. 3 depicts a technique 300 for controllably distorting a given multiplex graph or purposes of entity anonymization, according to example implementations.
  • nodes of the given multiplex graph are replaced (block 304) with nodes that are determined to be statistically and/or topologically similar to the replaced nodes, using any of a number of techniques, as can be appreciated by one of ordinary skill in the art.
  • the strength(s) of one or multiple multiplex edges are modulated, pursuant to block 308.
  • the multiplex edge(s) selected for modulation may be all multiplex edge(s) between nodes representing common entities, may be a predefined number of such multiplex edge(s); or the multiplex edges that are modulated may be selected based on other criteria.
  • the strength of a multiplex edge refers to the degree to which the nodes connected by the edge are related and may be, as examples, a binary indication that denotes the existence/non-existence of a relationship or a probability that represents the likelihood that a relationship exists.
  • the modulation of the edge strength refers to changing the edge strength, such as through removing the edge (assigning a strength of zero), increasing the edge strength to represent a stronger relationship or decreasing the edge strength to represent a weaker edge strength.
  • one or multiple multiplex edges may be added, pursuant to block 312, for purposes of creating a noise to further obscure identification of entities that are common to multiple graphs of the multiplex graph.
  • multiplex edge(s) may be created if similar nodes do not exist due to the size of the multiplex graph or the sizes of the graphs forming the multiplex graph (based on size thresholds represented by data (administrator data, for example) stored in the query processing system, for example).
  • Fig. 3 The examples of Fig. 3 are not limiting. One or more techniques described in Fig. 3 may be used to controllably distort a multiplex graph for entity anonymization. Further, other and/or different ways may be used to controllably distort a multiplex for entity anonymization, such as ways that are not in Fig. 3. Thus, many techniques described in Fig. 3 may be used to controllably distort a multiplex graph for entity anonymization. Further, other and/or different ways may be used to controllably distort a multiplex for entity anonymization, such as ways that are not in Fig. 3. Thus, many
  • Fig. 4 depicts an example illustration 400 of node replacement for purposes of controllably distorting a multiplex graph.
  • an employee profile graph 404 contains various nodes pertaining to employees of a given organization.
  • the employee profile graph 404 includes an example node A 408.
  • the employee profile graph 404 may represent various attributes of employees of the business organization, such as employee salaries, social security numbers, residential addresses, job titles, and so forth.
  • node A 408 represents an employee of the organization that is also a member of a casino and as such, appears as node A' 424 of a casino membership graph 420.
  • the two graphs 404 and 420 for illustration 400 form a multiplex graph, in that a multiplex edge 416 links the two nodes A 408 and A' 424 together.
  • node A' 424 although associated with the same entity, has different attributes that node A 408, such as points accumulated at the casino, length of casino membership, observed betting limits, and so forth.
  • the anonymization engine 1 10 applies a transformation 430 (i.e., a controlled distortion) to change the multiplex containing the employee profile graph 404 and casino membership graph 420 into a transformation 430 (i.e., a controlled distortion) to change the multiplex containing the employee profile graph 404 and casino membership graph 420 into a transformation 430 (i.e., a controlled distortion) to change the multiplex containing the employee profile graph 404 and casino membership graph 420 into a
  • the employee profile graph 440 contains a replacement node A R 444, which is replacement of the node 408, and the casino membership graph 460 contains a replacement node A R ' 464, which is a replacement of the node A 424.
  • a multiplex edge 450 exists between the replacement nodes 444 and 464. Due to the transformation 430, entity identification among the original and transformed multiplex graphs is obscured, as the replacement nodes 444 and 464 have different associated identities.
  • Fig. 5 depicts a technique 500 to perform node replacement according to example implementations. Referring to Fig. 5, for the technique 500, the
  • the anonymization engine 1 10 may consult an entity directory for purposes of identifying node replacements.
  • the entity directory may be, for example, a table that is previously constructed by the entity anonymization engine 1 10, and the table may be indexed by the entities of the multiplex graph. In this manner, for an individual entity index, the table may contain a set of similar entities that may be used in replacement of the entity that is represented by the index.
  • the entity anonymization engine 1 10 may construct the entity directory by identifying similar entities using data related to the topology of the multiplex graph, such as, for example, the number of outgoing edges, the local motifs, clustering coefficients, and so forth.
  • the entity anonymization engine 100 may identify similar nodes using a spectral analysis of the neighborhood of the nodes of the multiplex graph.
  • a Laplacian of the first or second relationships may be determined in this spectral analysis.
  • the entity anonymization engine 1 10 may identify entity replacements for a given entity using topography analysis, spectral analysis, a combination of topography and spectral analyses and/or other analyses.
  • a synthesized node may end up replacing a node in the multiplex graph.
  • a replacing node maybe derived from a similar node in other graphs, may be derived from a node in the same graph but using other graphs in the multiplex and/or may be otherwise derived.
  • other similarity techniques that do not take the connectedness properties of the nodes of the graph into account may be used to identify entity replacements, such as a technique to "edit/Hamming/Jaccard/Cosine distance" on the attributes of the nodes/edges or a technique that performs local sensitive hashing followed by clustering of similar items.
  • the entity directory may be used (block 504) to retrieve candidate similar nodes for the next node to be replaced.
  • the anonymization engine 1 1 0 may replace all of the nodes of the multiplex graph, nodes whose represented entities are present in more than one graph of the multiplex graph, or nodes selected according to other criteria.
  • Fig. 5 logically depicts the node replacement occurring in a serial sequence, multiple nodes may be replaced in parallel, in accordance with further, example implementations.
  • the anonymization engine 1 10 may apply several preferences (configurable preferences selected as configurable options by an administrator, preferences that are always applied by default by the engine 1 10, a combination of default and configuration preferences, and so forth) for purposes of selecting a particular candidate node for the replacement.
  • the technique 500 may include applying (block 508) a preference to identify candidate similar nodes that have relationships with respective graphs in which the nodes will be used.
  • candidate nodes of the graph 404 may be preferred; and when determining the replacement node A R ' 464, candidate nodes of the graph 420 may be preferred.
  • the technique 500 may further apply (block 512) a preference to identify candidate similar nodes that do not involve nodes being replaced. In other words, preference is given to nodes that are not connected to the nodes being replaced, in accordance with example implementations.
  • the technique 500 includes selecting (block 514) the candidate similar node and then synthesizing (block 516) the replacement nodes based on the selected similar node.
  • synthesis of the replacement node includes modifying the selected similar node to have similar non-identifying attributes as the original node that is being replaced.
  • control returns back to block 504.
  • the replacement node may be synthesized based at least in part on one identified candidate node or based on multiple identified candidate nodes.
  • Fig. 5 depicts sequential node replacement, although node replacement may occur in parallel and in many different forms, depending on the particular implementation. Thus, many implementations are contemplated, which are within the scope of the appended claims.
  • the entity anonymization engine 100 may modulate the strength(s) of multiplex edge(s) (i.e., pursuant to block 308 of Fig. 3) as follows.
  • the entity anonymization engine 1 00 may selectively remove a given multiplex edge based on an indicated desire of the underlying databases to be connected to other databases.
  • a given database may store metadata that represents a number or another value, which indicates a degree of anonymity for the database. For example, for Fig.
  • the casino membership graph 460 may indicate a relatively high preference for social networks and thus, express a relatively low desire for anonymity; but the employee profile graph 440 may express a relatively low desire for social networks and as such, may have a relatively high desire for confidentiality.
  • the resulting logical combination of the two desires may result in the anonymization engine 1 10 removing example multiplex edge 416 (as well as possibly other multiplex edges connecting graphs 404 and 420) in the transformed multiplex graph.
  • the anonymization engine 1 10 may not entirely remove a given edge.
  • the anonymization engine 1 1 0 may modulate multiplex edge strength by modulation an edge probability, as the modulation also reduces the probability of the entity showing up in ranked results.
  • the anonymization engine 1 10 may create new multiplex edges based on an analysis of local properties of the nodes that are involved if the multiplex edges are formed.
  • the anonymization engine 1 10 may identify involved nodes from the above- described entity directory, and new multiplex edges may be created across nodes spanning similar edges. This feature adds noise to the query result, which preserves the privacies of some graph entities.
  • the above-described policies may be applied in a supervised environment in that the anonymization engine 1 10 may provide suggestions (via a graphical user interface (GUI), for example) and may receive confirmation from a human that may validate the anonymization being performed on a per query basis.
  • GUI graphical user interface
  • the anonymization engine 1 10 may run with little or no human supervision and thus, may automatically transform the multiplex graph.
  • the changes that are made by the anonymization engine 1 10 may be performed at query time, in that the anonymization engine 1 10 may create the entity directory before query time but the anonymization itself occurs at query time.
  • the anonymization engine 1 10 may perform the anonymization before any query is issued. While performing the changes at query time accommodates dynamic changes in the multiplex graph, performing the anonymization before any query is issued may result in reduced latencies for users.
  • the replacement of the initial multiplex graph with the transformed graph is transparent to the query engine 150.
  • the query engine 1 50 processes the query 105 normally and, in general, is unaware of the fact that the multiplex graph has been altered.
  • the query processing system 100 may or may not be executed in the form of one or multiple guest virtual machines (VMs). Regardless of whether virtual machines are employed or not, the query processing system 100 executes on a physical machine platform, as represented by physical machine 600 in Fig. 6.
  • VMs guest virtual machines
  • the physical machine 600 is an actual machine that is made up of actual hardware 610 and actual machine executable instructions 660.
  • the hardware 610 may include one or more central processing units (CPUs) 614, memory 616 (non-volatile memory and volatile memory, for example) and one or multiple network interfaces 620.
  • the machine executable instructions 660 may include instructions that, when executed by one or more of the CPU(s) 614, may form the query engine 150, anonymization engine 1 10, an operating system 664, one or multiple device drivers 668, and so forth.
  • FIG. 6 depicts the physical machine 600 as being contained in a single box, or rack, it is noted that the physical machine 600 may be formed from multiple boxes or racks. Moreover, the physical machine 600 may be, in accordance with example implementations, a distributed processing system that is physically located at different geographical locations. Thus, many implementations are contemplated, which are within the scope of the appended claims.

Abstract

An example technique includes receiving data representing a given multiplex graph. The example technique entity anonymizes a result for a query directed to the given multiplex graph to reduce a likelihood that the result reveals an entity common to multiple graphs of the given multiplex graph. Entity anonymizing the result may include processing the data in a processor-based machine to controllably distort the given multiplex graph to produce another multiplex graph to be processed for the query in place of the given multiplex graph.

Description

ENTITY ANONYM IZATION FOR A QUERY DIRECTED TO A MULTIPLEX GRAPH Background
[0001 ] For purposes of enhancing the retrieval and storage of large volumes of data, the data may be organized in a database. One type of database is a relational database in which data is stored in tables. In the relational database, a given table defines a relation among the data stored in the table; and relations may also exist among tables of the relational database. Another type of database is a graph database, which is based on a graph structure having nodes, properties and edges. The nodes represent entities, and the properties are pertinent information that relate to the nodes and the edges. The edges are the lines that connect nodes; and a given edge represents a relationship between connected nodes.
Brief Description of the Drawings
[0002] Fig 1 is a schematic diagram illustrating a query processing system according to an example implementation.
[0003] Fig. 2 is a flow diagram depicting a technique to entity anonymize a result for a query that spans a multiplex graph according to an example implementation.
[0004] Fig. 3 is a flow diagram depicting a technique to controllably distort a multiplex graph according to an example implementation.
[0005] Fig. 4 is an illustration of node replacement according to an example implementation.
[0006] Fig. 5 is a flow diagram depicting a technique to synthesize replacement nodes for a multiplex graph according to an example implementation.
[0007] Fig. 6 is a schematic diagram of a query processing system illustrating a physical machine of the system according to a further example implementation.
Detailed Description
[0008] Graph database technology is ever increasingly being used by enterprises for tasks that involve searching connections among entities, businesses or any other items. As examples, an enterprise may access a graph database for purposes of serving up online recommendations to millions of Internet users, managing master data hierarchies, or routing millions of packages per day in real time.
[0009] A graph database has a structure that is based on graph theory. In this manner, a graph may include nodes, properties and edges. The nodes represent entities, such as people, businesses, accounts or any other item that is tracked. The properties may include information that relates to the nodes and to the edges. The edges include lines that connect the nodes to other nodes, and in general, an edge may represent a relationship between a given node and another node.
[0010] The connections and interconnections of nodes and edges and the properties of the nodes and edges often reveal meaningful patterns. For purposes of searching a given graph in response to a query, a query engine may traverse the nodes and edges of the graph. A query refers to a request for information and may include, for example, a statement requesting information from a database.
[001 1 ] Data contained in a given database may be relatively sensitive. As such, the data may be processed to anonymize the data. Anonymization of the data may prevent queries of the data from revealing information about participants from which the data was collected. As examples, a graph database may store information pertaining to a census, a survey, human resource information, media records, and so forth.
[0012] A given graph may have one or more relationships to another graph. A set of graphs, in which certain nodes of one graph represent real world entities in at least one of the other graphs is called a "multiplex graph." The multiplex graph may contain multiplex edges, in which an individual multiplex edge may represent a relationship between various nodes of the set of graphs. For example, a multiplex edge may represent a relationship between a first node of a first graph and a second node of a second graph. [0013] In accordance with example systems and techniques that are disclosed herein, a multiplex graph may be controllably distorted for purposes of preventing a query of the multiplex graph from revealing information about participants from which the data was collected.
[0014] ., A given multiplex graph may include a set of individual graphs, where an individual graph may represent information related to a particular field, of a particular scope, and/or otherwise representing a particular subsection of data. In some examples, the set of graphs associated with a given multiplex may include graphs of varying scope and with varying data fields.
[0015] An example query of a multiplex graph may be directed to a particular multiplex graph that contains the following individual graphs: a graph that derived from a business enterprise's employee profile database (for example, a database containing information such as employee names, social security numbers, home addresses, salaries, job titles, and so forth); and a graph derived from a health services database, which also contains information for one or multiple employees of the business organization.
[0016] Processing the query involves identifying entities (e.g., employees) that are represented by the nodes in these individual graphs and creating links across the graphs (i.e., multiplex edges). A given entity may be represented by corresponding nodes that appear in multiple graphs of the set of graphs forming the multiplex graph. If not for measures discussed herein, a query that spans a multiplex graph, such as the one described above, may potentially reveal sensitive personal information, such as, for example, medical histories of employees of the business organization.
[0017] Techniques and systems are disclosed herein to "entity anonymize" the result of a query that spans, or traverses, a multiplex graph. Entity anonymization refers to a query processing technique that, in general, reduces the likelihood that the query reveals that a given entity is the same across two or more graphs of a multiplex graph. In accordance with example systems and techniques that are discussed herein, an initial multiplex graph to be spanned by the query is first controllably distorted so that the query actually traverses the distorted multiplex graph instead of the initial multiplex graph. Referring to Fig. 1 , in accordance with an example implementation, a query processing system 100 may include a query engine 150 that provides a query result 160 in response to a query 105.
[0018] For the example that is depicted in Fig. 1 , the query 105 spans an example multiplex graph 120 in that processing the query generally involves traversing the multiplex graph 120. As illustrated in Fig. 1 , the multiplex graph 120 contains multiple graphs 124, which may be, in general, interlinked by multiplex edges (not shown) that define relationships between nodes 1 26 of the graphs 124.
[0019] For purposes of entity anonymizing the query result 160, the query processing system 100 may replace the initial multiplex graph 120 with a transformed multiplex graph 130 for processing for by the query engine 130. The query processing system may include an anonymization engine 1 10 that controllably distorts the multiplex graph 120 to produce the transformed multiplex graph 130 for purposes of performing entity anonymization.
[0020] Similar to the multiplex graph 120, the transformed multiplex graph 130 may contain a set of interlinked graphs 134; and as described herein, the anonymization engine 1 10, in the controllable distortion, may preserve the topology of the initial multiplex graph 120, such as the motifs (subgraph patterns, for example), numbers of outgoing edges, clustering coefficients, and/or other topological properties of the multiplex graph 120. Although the topological properties may be preserved, the anonymization engine 1 10, in the controllable distortion, may change graph properties in a manner that prohibits, or at least reduces, the likelihood that the query result 160 reveals that a given entity exists across multiple graphs 124.
[0021 ] Fig. 2 depicts a technique to perform entity anonymization in accordance with example implementations. Referring to Fig. 2, the technique 200 includes receiving (block 204) data representing a given multiplex graph, such as receiving data that represents the multiplex graph 1 20 of Fig. 1 , for example. The technique 200 includes entity anonymizing (block 208) the result for a query directed to the given multiplex graph to reduce the likelihood that the result of the query reveals an entity that is common to multiple graphs of the given multiplex graph, such as, for example, the use of an entity anonymization engine (the use of anonymization engine 1 10 of Fig. 1 , for example). This entity anonymization includes processing the data to controllably distort the given multiplex graph to produce another multiplex graph (such as multiplex graph 130 of Fig. 1 , for example) to be processed for the query in place of the given multiplex graph.
[0022] The controllable distortion of the multiplex graph may be performed in many different ways, depending on the particular implementation. In accordance with example implementations, the controllable distortion may be performed by one or more of the following: node replacement, multiplex edge strength modulation and multiplex edge addition. Fig. 3 depicts a technique 300 for controllably distorting a given multiplex graph or purposes of entity anonymization, according to example implementations.
[0023] Pursuant to the technique 300, nodes of the given multiplex graph are replaced (block 304) with nodes that are determined to be statistically and/or topologically similar to the replaced nodes, using any of a number of techniques, as can be appreciated by one of ordinary skill in the art.
[0024] In some examples, according to the technique 300, the strength(s) of one or multiple multiplex edges are modulated, pursuant to block 308. Depending on the implementation, the multiplex edge(s) selected for modulation may be all multiplex edge(s) between nodes representing common entities, may be a predefined number of such multiplex edge(s); or the multiplex edges that are modulated may be selected based on other criteria. The strength of a multiplex edge refers to the degree to which the nodes connected by the edge are related and may be, as examples, a binary indication that denotes the existence/non-existence of a relationship or a probability that represents the likelihood that a relationship exists. The modulation of the edge strength refers to changing the edge strength, such as through removing the edge (assigning a strength of zero), increasing the edge strength to represent a stronger relationship or decreasing the edge strength to represent a weaker edge strength.
[0025] Also, in accordance with example implementations, one or multiple multiplex edges may be added, pursuant to block 312, for purposes of creating a noise to further obscure identification of entities that are common to multiple graphs of the multiplex graph. In accordance with example implementations, multiplex edge(s) may be created if similar nodes do not exist due to the size of the multiplex graph or the sizes of the graphs forming the multiplex graph (based on size thresholds represented by data (administrator data, for example) stored in the query processing system, for example).
[0026] The examples of Fig. 3 are not limiting. One or more techniques described in Fig. 3 may be used to controllably distort a multiplex graph for entity anonymization. Further, other and/or different ways may be used to controllably distort a multiplex for entity anonymization, such as ways that are not in Fig. 3. Thus, many
implementations are contemplated, which are within the scope of the appended claims.
[0027] Fig. 4 depicts an example illustration 400 of node replacement for purposes of controllably distorting a multiplex graph. For this example, an employee profile graph 404 contains various nodes pertaining to employees of a given organization. For this example, the employee profile graph 404 includes an example node A 408. The employee profile graph 404 may represent various attributes of employees of the business organization, such as employee salaries, social security numbers, residential addresses, job titles, and so forth.
[0028] The employees may have relationships that link the employees to other graphs. For the example illustration 400, node A 408 represents an employee of the organization that is also a member of a casino and as such, appears as node A' 424 of a casino membership graph 420. Thus, the two graphs 404 and 420 for illustration 400 form a multiplex graph, in that a multiplex edge 416 links the two nodes A 408 and A' 424 together. For the casino membership graph 420, node A' 424, although associated with the same entity, has different attributes that node A 408, such as points accumulated at the casino, length of casino membership, observed betting limits, and so forth.
[0029] Continuing the example, the anonymization engine 1 10 (Fig. 1 ) applies a transformation 430 (i.e., a controlled distortion) to change the multiplex containing the employee profile graph 404 and casino membership graph 420 into a
transformed multiplex that contains respective employee profile 440 and casino member 460 graphs. As illustrated in Fig. 4, the employee profile graph 440 contains a replacement node AR 444, which is replacement of the node 408, and the casino membership graph 460 contains a replacement node AR' 464, which is a replacement of the node A 424. As also depicted in Fig. 4, a multiplex edge 450 exists between the replacement nodes 444 and 464. Due to the transformation 430, entity identification among the original and transformed multiplex graphs is obscured, as the replacement nodes 444 and 464 have different associated identities.
[0030] Fig. 5 depicts a technique 500 to perform node replacement according to example implementations. Referring to Fig. 5, for the technique 500, the
anonymization engine 1 10 may consult an entity directory for purposes of identifying node replacements. The entity directory may be, for example, a table that is previously constructed by the entity anonymization engine 1 10, and the table may be indexed by the entities of the multiplex graph. In this manner, for an individual entity index, the table may contain a set of similar entities that may be used in replacement of the entity that is represented by the index.
[0031 ] .
[0032] In accordance with example implementations, the entity anonymization engine 1 10 may construct the entity directory by identifying similar entities using data related to the topology of the multiplex graph, such as, for example, the number of outgoing edges, the local motifs, clustering coefficients, and so forth.
[0033] The entity anonymization engine 100 may identify similar nodes using a spectral analysis of the neighborhood of the nodes of the multiplex graph. In this regard, as an example, a Laplacian of the first or second relationships may be determined in this spectral analysis.
[0034] Depending on the particular implementation, the entity anonymization engine 1 10 may identify entity replacements for a given entity using topography analysis, spectral analysis, a combination of topography and spectral analyses and/or other analyses.
[0035] In some examples, a synthesized node may end up replacing a node in the multiplex graph. In some examples, there may be a correlation between a number of graphs and a number of available nodes for replacement. For example, with a higher number of graphs in the multiplex, a higher number of nodes may be available for replacement. In general, a replacing node maybe derived from a similar node in other graphs, may be derived from a node in the same graph but using other graphs in the multiplex and/or may be otherwise derived.
[0036] In accordance with further example implementations, other similarity techniques that do not take the connectedness properties of the nodes of the graph into account may be used to identify entity replacements, such as a technique to "edit/Hamming/Jaccard/Cosine distance" on the attributes of the nodes/edges or a technique that performs local sensitive hashing followed by clustering of similar items.
[0037] According to the technique 500, the entity directory may be used (block 504) to retrieve candidate similar nodes for the next node to be replaced. As examples, the anonymization engine 1 1 0 may replace all of the nodes of the multiplex graph, nodes whose represented entities are present in more than one graph of the multiplex graph, or nodes selected according to other criteria. Although Fig. 5 logically depicts the node replacement occurring in a serial sequence, multiple nodes may be replaced in parallel, in accordance with further, example implementations.
[0038] In accordance with example implementations, the anonymization engine 1 10 may apply several preferences (configurable preferences selected as configurable options by an administrator, preferences that are always applied by default by the engine 1 10, a combination of default and configuration preferences, and so forth) for purposes of selecting a particular candidate node for the replacement. For example, the technique 500 may include applying (block 508) a preference to identify candidate similar nodes that have relationships with respective graphs in which the nodes will be used.
[0039] Referring back to Fig. 4, in this manner, in determining the replacement node AR 444, candidate nodes of the graph 404 may be preferred; and when determining the replacement node AR' 464, candidate nodes of the graph 420 may be preferred. The technique 500 may further apply (block 512) a preference to identify candidate similar nodes that do not involve nodes being replaced. In other words, preference is given to nodes that are not connected to the nodes being replaced, in accordance with example implementations.
[0040] Based at least in part on such preferences, the technique 500 includes selecting (block 514) the candidate similar node and then synthesizing (block 516) the replacement nodes based on the selected similar node. In accordance with example implementations, synthesis of the replacement node includes modifying the selected similar node to have similar non-identifying attributes as the original node that is being replaced. According to example implementations, pursuant to technique 500, responsive to another node being replaced (decision block 520), control returns back to block 504. In according to further example implementations, the replacement node may be synthesized based at least in part on one identified candidate node or based on multiple identified candidate nodes.
[0041 ] It is noted that Fig. 5 depicts sequential node replacement, although node replacement may occur in parallel and in many different forms, depending on the particular implementation. Thus, many implementations are contemplated, which are within the scope of the appended claims.
[0042] Referring to back to Fig. 3 in conjunction with Fig. 1 , in accordance with example implementations, the entity anonymization engine 100 may modulate the strength(s) of multiplex edge(s) (i.e., pursuant to block 308 of Fig. 3) as follows. The entity anonymization engine 1 00 may selectively remove a given multiplex edge based on an indicated desire of the underlying databases to be connected to other databases. As an example, a given database may store metadata that represents a number or another value, which indicates a degree of anonymity for the database. For example, for Fig. 4, in associated metadata, the casino membership graph 460 may indicate a relatively high preference for social networks and thus, express a relatively low desire for anonymity; but the employee profile graph 440 may express a relatively low desire for social networks and as such, may have a relatively high desire for confidentiality. For this specific example, the resulting logical combination of the two desires may result in the anonymization engine 1 10 removing example multiplex edge 416 (as well as possibly other multiplex edges connecting graphs 404 and 420) in the transformed multiplex graph. [0043] In accordance with further example implementations, the anonymization engine 1 10 may not entirely remove a given edge. In this manner, in accordance with example implementations, the anonymization engine 1 1 0 may modulate multiplex edge strength by modulation an edge probability, as the modulation also reduces the probability of the entity showing up in ranked results.
[0044] For purposes of the anonymization engine 1 10 adding one or multiple multiplex edges (i.e., pursuant to block 312 of Fig. 3) to the transformed multiplex graph, the anonymization engine 1 10 may create new multiplex edges based on an analysis of local properties of the nodes that are involved if the multiplex edges are formed. The anonymization engine 1 10 may identify involved nodes from the above- described entity directory, and new multiplex edges may be created across nodes spanning similar edges. This feature adds noise to the query result, which preserves the privacies of some graph entities.
[0045] Referring back to Fig .1 , depending on the particular implementation, the above-described policies may be applied in a supervised environment in that the anonymization engine 1 10 may provide suggestions (via a graphical user interface (GUI), for example) and may receive confirmation from a human that may validate the anonymization being performed on a per query basis. In further example implementations, the anonymization engine 1 10 may run with little or no human supervision and thus, may automatically transform the multiplex graph.
[0046] In accordance with example implementations, the changes that are made by the anonymization engine 1 10 may be performed at query time, in that the anonymization engine 1 10 may create the entity directory before query time but the anonymization itself occurs at query time. In further example implementations, the anonymization engine 1 10 may perform the anonymization before any query is issued. While performing the changes at query time accommodates dynamic changes in the multiplex graph, performing the anonymization before any query is issued may result in reduced latencies for users.
[0047] In accordance with example implementations, the replacement of the initial multiplex graph with the transformed graph is transparent to the query engine 150. In other words, after the anonymization engine 1 10 controllably distorts the multiplex graph, the query engine 1 50 processes the query 105 normally and, in general, is unaware of the fact that the multiplex graph has been altered.
[0048] Referring to Fig. 6, depending on the particular example implementation, the query processing system 100 may or may not be executed in the form of one or multiple guest virtual machines (VMs). Regardless of whether virtual machines are employed or not, the query processing system 100 executes on a physical machine platform, as represented by physical machine 600 in Fig. 6.
[0049] In this regard, the physical machine 600 is an actual machine that is made up of actual hardware 610 and actual machine executable instructions 660. As examples, the hardware 610 may include one or more central processing units (CPUs) 614, memory 616 (non-volatile memory and volatile memory, for example) and one or multiple network interfaces 620. The machine executable instructions 660 may include instructions that, when executed by one or more of the CPU(s) 614, may form the query engine 150, anonymization engine 1 10, an operating system 664, one or multiple device drivers 668, and so forth.
[0050] Although Fig. 6 depicts the physical machine 600 as being contained in a single box, or rack, it is noted that the physical machine 600 may be formed from multiple boxes or racks. Moreover, the physical machine 600 may be, in accordance with example implementations, a distributed processing system that is physically located at different geographical locations. Thus, many implementations are contemplated, which are within the scope of the appended claims.
[0051 ] While the present techniques have been described with respect to a number of embodiments, it will be appreciated that numerous modifications and variations may be applicable therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the scope of the present techniques.

Claims

What is claimed is: 1 . A method comprising:
receiving data representing a first multiplex graph; and
entity anonymizing a result for a query directed to a given multiplex graph to reduce a likelihood that the result reveals an entity common to multiple graphs of the given multiplex graph, wherein entity anonymizing the result comprises processing the data in a processor-based machine to controllably distort the given multiplex graph to produce another multiplex graph to be processed for the query in place of the given multiplex graph.
2. The method of claim 1 , wherein processing the data to controllably distort the given multiplex graph comprises replacing at least one node of the given multiplex graph based at least in part on a node determined to be similar to the node being replaced.
3. The method of claim 2, further comprising identifying the at least one similar node based at least in part on a spectral analysis of the multiplex graph or an analysis that is based at least in part on a topology of the multiplex graph.
4. The method of claim 1 , wherein processing the data to controllably distort the given multiplex graph comprises modulating at least one edge extending between graphs of the given multiplex graph.
5. The method of claim 1 , wherein processing the data to controllably distort the given multiplex graph comprises selectively adding at least one edge between graphs of the first multiplex graph.
6. A system comprising:
a query engine to process a query directed to a given multiplex graph; and an anonymization engine comprising a processor to controllably distort the given multiplex graph to provide another multiplex graph to be processed for the query by the query engine in place of the given multiplex graph.
7. The system of claim 6, wherein the anonymization engine replaces at least one node of the given multiplex graph based on a node identified by an entity directory as being similar to the node being replaced to controllably distort the given multiplex graph.
8. The system of claim 7, wherein the entity directory identifies multiple candidate similar nodes for the node being replaced, and the anonymization engine synthesizes the replacement node based at least in part on the identified candidate nodes.
9. The system of claim 6, wherein the anonymization engine modulates at least one multiplex edge of the given multiplex graph to controllably distort the first multiplex.
10. The system of claim 6, wherein the anonymization engine selectively adds at least one multiplex edge to the given multiplex graph to selectively distort the given multiplex graph.
1 1 . A non-transitory computer readable storage medium to store instructions that when executed by a processor-based system cause the processor- based system to:
receive data representing a given multiplex graph; and
entity anonymize a result for a query directed to the given multiplex graph to reduce a likelihood that the result reveals an entity common to multiple graphs of the given multiplex graph, the instructions to cause the processor-based system to process the data to controllably distort the given multiplex graph to produce another multiplex graph to be processed for the query in place of the given multiplex graph.
12. The medium of claim 1 1 , the storage medium storing instructions that when executed by the processor-based system cause the processor-based system to process the data to controllably distort the given multiplex graph comprises replacing at least one node of the given multiplex graph based at least in part on a node determined to be similar to the node being replaced.
13. The medium of claim 1 2, the storage medium storing instructions that when executed by the processor-based system cause the processor-based system to identify the at least one similar node based at least in part on a spectral analysis or topological analysis of the multiplex graph.
14. The medium of claim 1 1 , the storage medium storing instructions that when executed by the processor-based system cause the processor-based system to process the data to controllably distort the given multiplex graph comprises modulating at least one edge extending between graphs of the given multiplex graph.
15. The medium of claim 1 1 , the storage medium storing instructions that when executed by the processor-based system cause the processor-based system to process the data to controllably distort the given multiplex graph comprises selectively adding at least one edge between graphs of the given multiplex graph.
PCT/US2014/062659 2014-10-28 2014-10-28 Entity anonymization for a query directed to a multiplex graph WO2016068883A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2014/062659 WO2016068883A1 (en) 2014-10-28 2014-10-28 Entity anonymization for a query directed to a multiplex graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2014/062659 WO2016068883A1 (en) 2014-10-28 2014-10-28 Entity anonymization for a query directed to a multiplex graph

Publications (1)

Publication Number Publication Date
WO2016068883A1 true WO2016068883A1 (en) 2016-05-06

Family

ID=55857998

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/062659 WO2016068883A1 (en) 2014-10-28 2014-10-28 Entity anonymization for a query directed to a multiplex graph

Country Status (1)

Country Link
WO (1) WO2016068883A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040264697A1 (en) * 2003-06-27 2004-12-30 Microsoft Corporation Group security
EP2653984A1 (en) * 2012-04-18 2013-10-23 Software AG Method and system for anonymizing data during export
US20140143239A1 (en) * 2010-08-03 2014-05-22 Accenture Global Services Limited Database anonymization for use in testing database-centric applications
US20140278409A1 (en) * 2004-07-30 2014-09-18 At&T Intellectual Property Ii, L.P. Preserving privacy in natural langauge databases
US20140304825A1 (en) * 2011-07-22 2014-10-09 Vodafone Ip Licensing Limited Anonymization and filtering data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040264697A1 (en) * 2003-06-27 2004-12-30 Microsoft Corporation Group security
US20140278409A1 (en) * 2004-07-30 2014-09-18 At&T Intellectual Property Ii, L.P. Preserving privacy in natural langauge databases
US20140143239A1 (en) * 2010-08-03 2014-05-22 Accenture Global Services Limited Database anonymization for use in testing database-centric applications
US20140304825A1 (en) * 2011-07-22 2014-10-09 Vodafone Ip Licensing Limited Anonymization and filtering data
EP2653984A1 (en) * 2012-04-18 2013-10-23 Software AG Method and system for anonymizing data during export

Similar Documents

Publication Publication Date Title
US10831844B2 (en) Accessing databases
US10963512B2 (en) Query language interoperability in a graph database
US10614248B2 (en) Privacy preserving cross-organizational data sharing with anonymization filters
CN105183735B (en) The querying method and inquiry unit of data
US8826370B2 (en) System and method for data masking
US20070073695A1 (en) Server side filtering and sorting with field level security
KR102442737B1 (en) Computer-implemented system and method for anonymizing encrypted data
US11176128B2 (en) Multiple access path selection by machine learning
US10970300B2 (en) Supporting multi-tenancy in a federated data management system
EP1667062A1 (en) Resource reconciliation
US11375015B2 (en) Dynamic routing of file system objects
US11853329B2 (en) Metadata classification
US11500876B2 (en) Method for duplicate determination in a graph
US20200233861A1 (en) Elastic data partitioning of a database
WO2020070137A1 (en) Systems and methods for processing a database query
CN109614521B (en) Efficient privacy protection sub-graph query processing method
US11550792B2 (en) Systems and methods for joining datasets
CN113127848A (en) Storage method of permission system data and related equipment
US11604776B2 (en) Multi-value primary keys for plurality of unique identifiers of entities
WO2011149453A1 (en) Graph authorization
WO2016068883A1 (en) Entity anonymization for a query directed to a multiplex graph
US9569507B2 (en) Virtual directory server to process directory requests when information on an object is split across multiple data sources
Miller et al. Constrained k-anonymity: Privacy with generalization boundaries
JP7285511B2 (en) Data management device, data management method, program, and data management system
US11841857B2 (en) Query efficiency using merged columns

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14904776

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14904776

Country of ref document: EP

Kind code of ref document: A1