WO2016068883A1

WO2016068883A1 - Entity anonymization for a query directed to a multiplex graph

Info

Publication number: WO2016068883A1
Application number: PCT/US2014/062659
Authority: WO
Inventors: Luis Miguel Vaquero Gonzalez; Sae Lor SUKSANT
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2014-10-28
Filing date: 2014-10-28
Publication date: 2016-05-06

Abstract

An example technique includes receiving data representing a given multiplex graph. The example technique entity anonymizes a result for a query directed to the given multiplex graph to reduce a likelihood that the result reveals an entity common to multiple graphs of the given multiplex graph. Entity anonymizing the result may include processing the data in a processor-based machine to controllably distort the given multiplex graph to produce another multiplex graph to be processed for the query in place of the given multiplex graph.

Description

ENTITY ANONYM IZATION FOR A QUERY DIRECTED TO A MULTIPLEX GRAPH Background

[0001 ] For purposes of enhancing the retrieval and storage of large volumes of data, the data may be organized in a database. One type of database is a relational database in which data is stored in tables. In the relational database, a given table defines a relation among the data stored in the table; and relations may also exist among tables of the relational database. Another type of database is a graph database, which is based on a graph structure having nodes, properties and edges. The nodes represent entities, and the properties are pertinent information that relate to the nodes and the edges. The edges are the lines that connect nodes; and a given edge represents a relationship between connected nodes.

Brief Description of the Drawings

[0002] Fig 1 is a schematic diagram illustrating a query processing system according to an example implementation.

[0003] Fig. 2 is a flow diagram depicting a technique to entity anonymize a result for a query that spans a multiplex graph according to an example implementation.

[0004] Fig. 3 is a flow diagram depicting a technique to controllably distort a multiplex graph according to an example implementation.

[0005] Fig. 4 is an illustration of node replacement according to an example implementation.

[0006] Fig. 5 is a flow diagram depicting a technique to synthesize replacement nodes for a multiplex graph according to an example implementation.

[0007] Fig. 6 is a schematic diagram of a query processing system illustrating a physical machine of the system according to a further example implementation.

Detailed Description

[0008] Graph database technology is ever increasingly being used by enterprises for tasks that involve searching connections among entities, businesses or any other items. As examples, an enterprise may access a graph database for purposes of serving up online recommendations to millions of Internet users, managing master data hierarchies, or routing millions of packages per day in real time.

[0009] A graph database has a structure that is based on graph theory. In this manner, a graph may include nodes, properties and edges. The nodes represent entities, such as people, businesses, accounts or any other item that is tracked. The properties may include information that relates to the nodes and to the edges. The edges include lines that connect the nodes to other nodes, and in general, an edge may represent a relationship between a given node and another node.

[0010] The connections and interconnections of nodes and edges and the properties of the nodes and edges often reveal meaningful patterns. For purposes of searching a given graph in response to a query, a query engine may traverse the nodes and edges of the graph. A query refers to a request for information and may include, for example, a statement requesting information from a database.

[001 1 ] Data contained in a given database may be relatively sensitive. As such, the data may be processed to anonymize the data. Anonymization of the data may prevent queries of the data from revealing information about participants from which the data was collected. As examples, a graph database may store information pertaining to a census, a survey, human resource information, media records, and so forth.

[0012] A given graph may have one or more relationships to another graph. A set of graphs, in which certain nodes of one graph represent real world entities in at least one of the other graphs is called a "multiplex graph." The multiplex graph may contain multiplex edges, in which an individual multiplex edge may represent a relationship between various nodes of the set of graphs. For example, a multiplex edge may represent a relationship between a first node of a first graph and a second node of a second graph. [0013] In accordance with example systems and techniques that are disclosed herein, a multiplex graph may be controllably distorted for purposes of preventing a query of the multiplex graph from revealing information about participants from which the data was collected.

[0014] ., A given multiplex graph may include a set of individual graphs, where an individual graph may represent information related to a particular field, of a particular scope, and/or otherwise representing a particular subsection of data. In some examples, the set of graphs associated with a given multiplex may include graphs of varying scope and with varying data fields.

[0015] An example query of a multiplex graph may be directed to a particular multiplex graph that contains the following individual graphs: a graph that derived from a business enterprise's employee profile database (for example, a database containing information such as employee names, social security numbers, home addresses, salaries, job titles, and so forth); and a graph derived from a health services database, which also contains information for one or multiple employees of the business organization.

[0016] Processing the query involves identifying entities (e.g., employees) that are represented by the nodes in these individual graphs and creating links across the graphs (i.e., multiplex edges). A given entity may be represented by corresponding nodes that appear in multiple graphs of the set of graphs forming the multiplex graph. If not for measures discussed herein, a query that spans a multiplex graph, such as the one described above, may potentially reveal sensitive personal information, such as, for example, medical histories of employees of the business organization.

[0017] Techniques and systems are disclosed herein to "entity anonymize" the result of a query that spans, or traverses, a multiplex graph. Entity anonymization refers to a query processing technique that, in general, reduces the likelihood that the query reveals that a given entity is the same across two or more graphs of a multiplex graph. In accordance with example systems and techniques that are discussed herein, an initial multiplex graph to be spanned by the query is first controllably distorted so that the query actually traverses the distorted multiplex graph instead of the initial multiplex graph. Referring to Fig. 1 , in accordance with an example implementation, a query processing system 100 may include a query engine 150 that provides a query result 160 in response to a query 105.

[0018] For the example that is depicted in Fig. 1 , the query 105 spans an example multiplex graph 120 in that processing the query generally involves traversing the multiplex graph 120. As illustrated in Fig. 1 , the multiplex graph 120 contains multiple graphs 124, which may be, in general, interlinked by multiplex edges (not shown) that define relationships between nodes 1 26 of the graphs 124.

[0019] For purposes of entity anonymizing the query result 160, the query processing system 100 may replace the initial multiplex graph 120 with a transformed multiplex graph 130 for processing for by the query engine 130. The query processing system may include an anonymization engine 1 10 that controllably distorts the multiplex graph 120 to produce the transformed multiplex graph 130 for purposes of performing entity anonymization.

[0020] Similar to the multiplex graph 120, the transformed multiplex graph 130 may contain a set of interlinked graphs 134; and as described herein, the anonymization engine 1 10, in the controllable distortion, may preserve the topology of the initial multiplex graph 120, such as the motifs (subgraph patterns, for example), numbers of outgoing edges, clustering coefficients, and/or other topological properties of the multiplex graph 120. Although the topological properties may be preserved, the anonymization engine 1 10, in the controllable distortion, may change graph properties in a manner that prohibits, or at least reduces, the likelihood that the query result 160 reveals that a given entity exists across multiple graphs 124.

[0021 ] Fig. 2 depicts a technique to perform entity anonymization in accordance with example implementations. Referring to Fig. 2, the technique 200 includes receiving (block 204) data representing a given multiplex graph, such as receiving data that represents the multiplex graph 1 20 of Fig. 1 , for example. The technique 200 includes entity anonymizing (block 208) the result for a query directed to the given multiplex graph to reduce the likelihood that the result of the query reveals an entity that is common to multiple graphs of the given multiplex graph, such as, for example, the use of an entity anonymization engine (the use of anonymization engine 1 10 of Fig. 1 , for example). This entity anonymization includes processing the data to controllably distort the given multiplex graph to produce another multiplex graph (such as multiplex graph 130 of Fig. 1 , for example) to be processed for the query in place of the given multiplex graph.

[0022] The controllable distortion of the multiplex graph may be performed in many different ways, depending on the particular implementation. In accordance with example implementations, the controllable distortion may be performed by one or more of the following: node replacement, multiplex edge strength modulation and multiplex edge addition. Fig. 3 depicts a technique 300 for controllably distorting a given multiplex graph or purposes of entity anonymization, according to example implementations.

[0023] Pursuant to the technique 300, nodes of the given multiplex graph are replaced (block 304) with nodes that are determined to be statistically and/or topologically similar to the replaced nodes, using any of a number of techniques, as can be appreciated by one of ordinary skill in the art.

[0024] In some examples, according to the technique 300, the strength(s) of one or multiple multiplex edges are modulated, pursuant to block 308. Depending on the implementation, the multiplex edge(s) selected for modulation may be all multiplex edge(s) between nodes representing common entities, may be a predefined number of such multiplex edge(s); or the multiplex edges that are modulated may be selected based on other criteria. The strength of a multiplex edge refers to the degree to which the nodes connected by the edge are related and may be, as examples, a binary indication that denotes the existence/non-existence of a relationship or a probability that represents the likelihood that a relationship exists. The modulation of the edge strength refers to changing the edge strength, such as through removing the edge (assigning a strength of zero), increasing the edge strength to represent a stronger relationship or decreasing the edge strength to represent a weaker edge strength.

[0025] Also, in accordance with example implementations, one or multiple multiplex edges may be added, pursuant to block 312, for purposes of creating a noise to further obscure identification of entities that are common to multiple graphs of the multiplex graph. In accordance with example implementations, multiplex edge(s) may be created if similar nodes do not exist due to the size of the multiplex graph or the sizes of the graphs forming the multiplex graph (based on size thresholds represented by data (administrator data, for example) stored in the query processing system, for example).

[0026] The examples of Fig. 3 are not limiting. One or more techniques described in Fig. 3 may be used to controllably distort a multiplex graph for entity anonymization. Further, other and/or different ways may be used to controllably distort a multiplex for entity anonymization, such as ways that are not in Fig. 3. Thus, many

implementations are contemplated, which are within the scope of the appended claims.

[0027] Fig. 4 depicts an example illustration 400 of node replacement for purposes of controllably distorting a multiplex graph. For this example, an employee profile graph 404 contains various nodes pertaining to employees of a given organization. For this example, the employee profile graph 404 includes an example node A 408. The employee profile graph 404 may represent various attributes of employees of the business organization, such as employee salaries, social security numbers, residential addresses, job titles, and so forth.

[0028] The employees may have relationships that link the employees to other graphs. For the example illustration 400, node A 408 represents an employee of the organization that is also a member of a casino and as such, appears as node A' 424 of a casino membership graph 420. Thus, the two graphs 404 and 420 for illustration 400 form a multiplex graph, in that a multiplex edge 416 links the two nodes A 408 and A' 424 together. For the casino membership graph 420, node A' 424, although associated with the same entity, has different attributes that node A 408, such as points accumulated at the casino, length of casino membership, observed betting limits, and so forth.

[0029] Continuing the example, the anonymization engine 1 10 (Fig. 1 ) applies a transformation 430 (i.e., a controlled distortion) to change the multiplex containing the employee profile graph 404 and casino membership graph 420 into a

transformed multiplex that contains respective employee profile 440 and casino member 460 graphs. As illustrated in Fig. 4, the employee profile graph 440 contains a replacement node A_R 444, which is replacement of the node 408, and the casino membership graph 460 contains a replacement node A_R' 464, which is a replacement of the node A 424. As also depicted in Fig. 4, a multiplex edge 450 exists between the replacement nodes 444 and 464. Due to the transformation 430, entity identification among the original and transformed multiplex graphs is obscured, as the replacement nodes 444 and 464 have different associated identities.

[0030] Fig. 5 depicts a technique 500 to perform node replacement according to example implementations. Referring to Fig. 5, for the technique 500, the

anonymization engine 1 10 may consult an entity directory for purposes of identifying node replacements. The entity directory may be, for example, a table that is previously constructed by the entity anonymization engine 1 10, and the table may be indexed by the entities of the multiplex graph. In this manner, for an individual entity index, the table may contain a set of similar entities that may be used in replacement of the entity that is represented by the index.

[0031 ] .

[0032] In accordance with example implementations, the entity anonymization engine 1 10 may construct the entity directory by identifying similar entities using data related to the topology of the multiplex graph, such as, for example, the number of outgoing edges, the local motifs, clustering coefficients, and so forth.

[0033] The entity anonymization engine 100 may identify similar nodes using a spectral analysis of the neighborhood of the nodes of the multiplex graph. In this regard, as an example, a Laplacian of the first or second relationships may be determined in this spectral analysis.

[0034] Depending on the particular implementation, the entity anonymization engine 1 10 may identify entity replacements for a given entity using topography analysis, spectral analysis, a combination of topography and spectral analyses and/or other analyses.

[0035] In some examples, a synthesized node may end up replacing a node in the multiplex graph. In some examples, there may be a correlation between a number of graphs and a number of available nodes for replacement. For example, with a higher number of graphs in the multiplex, a higher number of nodes may be available for replacement. In general, a replacing node maybe derived from a similar node in other graphs, may be derived from a node in the same graph but using other graphs in the multiplex and/or may be otherwise derived.

[0036] In accordance with further example implementations, other similarity techniques that do not take the connectedness properties of the nodes of the graph into account may be used to identify entity replacements, such as a technique to "edit/Hamming/Jaccard/Cosine distance" on the attributes of the nodes/edges or a technique that performs local sensitive hashing followed by clustering of similar items.

[0037] According to the technique 500, the entity directory may be used (block 504) to retrieve candidate similar nodes for the next node to be replaced. As examples, the anonymization engine 1 1 0 may replace all of the nodes of the multiplex graph, nodes whose represented entities are present in more than one graph of the multiplex graph, or nodes selected according to other criteria. Although Fig. 5 logically depicts the node replacement occurring in a serial sequence, multiple nodes may be replaced in parallel, in accordance with further, example implementations.

[0038] In accordance with example implementations, the anonymization engine 1 10 may apply several preferences (configurable preferences selected as configurable options by an administrator, preferences that are always applied by default by the engine 1 10, a combination of default and configuration preferences, and so forth) for purposes of selecting a particular candidate node for the replacement. For example, the technique 500 may include applying (block 508) a preference to identify candidate similar nodes that have relationships with respective graphs in which the nodes will be used.

[0039] Referring back to Fig. 4, in this manner, in determining the replacement node A_R 444, candidate nodes of the graph 404 may be preferred; and when determining the replacement node A_R' 464, candidate nodes of the graph 420 may be preferred. The technique 500 may further apply (block 512) a preference to identify candidate similar nodes that do not involve nodes being replaced. In other words, preference is given to nodes that are not connected to the nodes being replaced, in accordance with example implementations.

[0040] Based at least in part on such preferences, the technique 500 includes selecting (block 514) the candidate similar node and then synthesizing (block 516) the replacement nodes based on the selected similar node. In accordance with example implementations, synthesis of the replacement node includes modifying the selected similar node to have similar non-identifying attributes as the original node that is being replaced. According to example implementations, pursuant to technique 500, responsive to another node being replaced (decision block 520), control returns back to block 504. In according to further example implementations, the replacement node may be synthesized based at least in part on one identified candidate node or based on multiple identified candidate nodes.

[0041 ] It is noted that Fig. 5 depicts sequential node replacement, although node replacement may occur in parallel and in many different forms, depending on the particular implementation. Thus, many implementations are contemplated, which are within the scope of the appended claims.

[0042] Referring to back to Fig. 3 in conjunction with Fig. 1 , in accordance with example implementations, the entity anonymization engine 100 may modulate the strength(s) of multiplex edge(s) (i.e., pursuant to block 308 of Fig. 3) as follows. The entity anonymization engine 1 00 may selectively remove a given multiplex edge based on an indicated desire of the underlying databases to be connected to other databases. As an example, a given database may store metadata that represents a number or another value, which indicates a degree of anonymity for the database. For example, for Fig. 4, in associated metadata, the casino membership graph 460 may indicate a relatively high preference for social networks and thus, express a relatively low desire for anonymity; but the employee profile graph 440 may express a relatively low desire for social networks and as such, may have a relatively high desire for confidentiality. For this specific example, the resulting logical combination of the two desires may result in the anonymization engine 1 10 removing example multiplex edge 416 (as well as possibly other multiplex edges connecting graphs 404 and 420) in the transformed multiplex graph. [0043] In accordance with further example implementations, the anonymization engine 1 10 may not entirely remove a given edge. In this manner, in accordance with example implementations, the anonymization engine 1 1 0 may modulate multiplex edge strength by modulation an edge probability, as the modulation also reduces the probability of the entity showing up in ranked results.

[0044] For purposes of the anonymization engine 1 10 adding one or multiple multiplex edges (i.e., pursuant to block 312 of Fig. 3) to the transformed multiplex graph, the anonymization engine 1 10 may create new multiplex edges based on an analysis of local properties of the nodes that are involved if the multiplex edges are formed. The anonymization engine 1 10 may identify involved nodes from the above- described entity directory, and new multiplex edges may be created across nodes spanning similar edges. This feature adds noise to the query result, which preserves the privacies of some graph entities.

[0045] Referring back to Fig .1 , depending on the particular implementation, the above-described policies may be applied in a supervised environment in that the anonymization engine 1 10 may provide suggestions (via a graphical user interface (GUI), for example) and may receive confirmation from a human that may validate the anonymization being performed on a per query basis. In further example implementations, the anonymization engine 1 10 may run with little or no human supervision and thus, may automatically transform the multiplex graph.

[0046] In accordance with example implementations, the changes that are made by the anonymization engine 1 10 may be performed at query time, in that the anonymization engine 1 10 may create the entity directory before query time but the anonymization itself occurs at query time. In further example implementations, the anonymization engine 1 10 may perform the anonymization before any query is issued. While performing the changes at query time accommodates dynamic changes in the multiplex graph, performing the anonymization before any query is issued may result in reduced latencies for users.

[0047] In accordance with example implementations, the replacement of the initial multiplex graph with the transformed graph is transparent to the query engine 150. In other words, after the anonymization engine 1 10 controllably distorts the multiplex graph, the query engine 1 50 processes the query 105 normally and, in general, is unaware of the fact that the multiplex graph has been altered.

[0048] Referring to Fig. 6, depending on the particular example implementation, the query processing system 100 may or may not be executed in the form of one or multiple guest virtual machines (VMs). Regardless of whether virtual machines are employed or not, the query processing system 100 executes on a physical machine platform, as represented by physical machine 600 in Fig. 6.

[0049] In this regard, the physical machine 600 is an actual machine that is made up of actual hardware 610 and actual machine executable instructions 660. As examples, the hardware 610 may include one or more central processing units (CPUs) 614, memory 616 (non-volatile memory and volatile memory, for example) and one or multiple network interfaces 620. The machine executable instructions 660 may include instructions that, when executed by one or more of the CPU(s) 614, may form the query engine 150, anonymization engine 1 10, an operating system 664, one or multiple device drivers 668, and so forth.

[0050] Although Fig. 6 depicts the physical machine 600 as being contained in a single box, or rack, it is noted that the physical machine 600 may be formed from multiple boxes or racks. Moreover, the physical machine 600 may be, in accordance with example implementations, a distributed processing system that is physically located at different geographical locations. Thus, many implementations are contemplated, which are within the scope of the appended claims.

[0051 ] While the present techniques have been described with respect to a number of embodiments, it will be appreciated that numerous modifications and variations may be applicable therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the scope of the present techniques.

Claims

What is claimed is: 1 . A method comprising:

receiving data representing a first multiplex graph; and

entity anonymizing a result for a query directed to a given multiplex graph to reduce a likelihood that the result reveals an entity common to multiple graphs of the given multiplex graph, wherein entity anonymizing the result comprises processing the data in a processor-based machine to controllably distort the given multiplex graph to produce another multiplex graph to be processed for the query in place of the given multiplex graph.

2. The method of claim 1 , wherein processing the data to controllably distort the given multiplex graph comprises replacing at least one node of the given multiplex graph based at least in part on a node determined to be similar to the node being replaced.

3. The method of claim 2, further comprising identifying the at least one similar node based at least in part on a spectral analysis of the multiplex graph or an analysis that is based at least in part on a topology of the multiplex graph.

4. The method of claim 1 , wherein processing the data to controllably distort the given multiplex graph comprises modulating at least one edge extending between graphs of the given multiplex graph.

5. The method of claim 1 , wherein processing the data to controllably distort the given multiplex graph comprises selectively adding at least one edge between graphs of the first multiplex graph.

6. A system comprising:

a query engine to process a query directed to a given multiplex graph; and an anonymization engine comprising a processor to controllably distort the given multiplex graph to provide another multiplex graph to be processed for the query by the query engine in place of the given multiplex graph.

7. The system of claim 6, wherein the anonymization engine replaces at least one node of the given multiplex graph based on a node identified by an entity directory as being similar to the node being replaced to controllably distort the given multiplex graph.

8. The system of claim 7, wherein the entity directory identifies multiple candidate similar nodes for the node being replaced, and the anonymization engine synthesizes the replacement node based at least in part on the identified candidate nodes.

9. The system of claim 6, wherein the anonymization engine modulates at least one multiplex edge of the given multiplex graph to controllably distort the first multiplex.

10. The system of claim 6, wherein the anonymization engine selectively adds at least one multiplex edge to the given multiplex graph to selectively distort the given multiplex graph.

1 1 . A non-transitory computer readable storage medium to store instructions that when executed by a processor-based system cause the processor- based system to:

receive data representing a given multiplex graph; and

entity anonymize a result for a query directed to the given multiplex graph to reduce a likelihood that the result reveals an entity common to multiple graphs of the given multiplex graph, the instructions to cause the processor-based system to process the data to controllably distort the given multiplex graph to produce another multiplex graph to be processed for the query in place of the given multiplex graph.

12. The medium of claim 1 1 , the storage medium storing instructions that when executed by the processor-based system cause the processor-based system to process the data to controllably distort the given multiplex graph comprises replacing at least one node of the given multiplex graph based at least in part on a node determined to be similar to the node being replaced.

13. The medium of claim 1 2, the storage medium storing instructions that when executed by the processor-based system cause the processor-based system to identify the at least one similar node based at least in part on a spectral analysis or topological analysis of the multiplex graph.

14. The medium of claim 1 1 , the storage medium storing instructions that when executed by the processor-based system cause the processor-based system to process the data to controllably distort the given multiplex graph comprises modulating at least one edge extending between graphs of the given multiplex graph.

15. The medium of claim 1 1 , the storage medium storing instructions that when executed by the processor-based system cause the processor-based system to process the data to controllably distort the given multiplex graph comprises selectively adding at least one edge between graphs of the given multiplex graph.