WO2015161899A1 - Determine relationships between entities in datasets - Google Patents

Determine relationships between entities in datasets Download PDF

Info

Publication number
WO2015161899A1
WO2015161899A1 PCT/EP2014/058525 EP2014058525W WO2015161899A1 WO 2015161899 A1 WO2015161899 A1 WO 2015161899A1 EP 2014058525 W EP2014058525 W EP 2014058525W WO 2015161899 A1 WO2015161899 A1 WO 2015161899A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
dataset
probability
entities
controller
Prior art date
Application number
PCT/EP2014/058525
Other languages
French (fr)
Inventor
Luis Miguel Vaquero Gonzalez
Original Assignee
Hewlett Packard Development Company L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Development Company L.P. filed Critical Hewlett Packard Development Company L.P.
Priority to PCT/EP2014/058525 priority Critical patent/WO2015161899A1/en
Publication of WO2015161899A1 publication Critical patent/WO2015161899A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems

Definitions

  • Datasets are arranged to store information in a structured manner.
  • a dataset for a social network may store various details concerning a user, such as, user name, user date of birth, user interests, user workplace and so on.
  • Datasets may, at least partially, store substantially the same information but using different wording.
  • a user's interests may include 'playing musical instruments'
  • the same user's interests may include 'musician'.
  • Datasets may also store different information for the same entity. For example, in a first dataset, there may be an entity "Father of two children", and in a second dataset, there may be an entity "computer software engineer" for the same person.
  • FIG. 1 illustrates a schematic diagram of an apparatus according to an example
  • FIG. 2 illustrates a multiplex graph according to an example
  • FIG. 3 illustrates a flow diagram of a method according to an example
  • Fig. 4 illustrates a flow diagram of another method according to an example
  • Fig. 5 illustrates a schematic diagram of a controller according to an example.
  • Fig. 1 illustrates a schematic diagram of an apparatus 10 including a controller 12, input apparatus 14, and output apparatus 16.
  • the apparatus 10 may also be referred to as a "computer apparatus", a "computer” or a “data storage apparatus”.
  • the apparatus 10 may be a single machine where the input apparatus 14 and the output apparatus 16 are connected to the controller 12 via a wired or a wireless link, and the controller 12, the input apparatus 14 and the output apparatus 16 are located in close proximity to one another (for example, in the same room as one another).
  • the apparatus 10 may be a distributed apparatus where the input apparatus 14 and the output apparatus 16 are located remotely from the controller 12 (for example, in a different room, in a different building, in a different city, or in different country).
  • the apparatus 10 may be a module.
  • 'module' refers to a unit or apparatus that excludes certain parts/components that would be added by an end manufacturer or a user.
  • the apparatus 10 may comprise the controller 12 and the remaining components (namely, the input apparatus 14 and the output apparatus 16) may be added by an end manufacturer.
  • the implementation of the controller 12 can be in hardware alone (for example, a circuit, a processor and so on), have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).
  • the controller 12 may be implemented using instructions that enable hardware functionality, for example, by using executable computer program instructions 22 in a general-purpose or special-purpose processor 18 that may be stored on a computer readable storage medium 20 (disk, memory and so on) to be executed by such a processor 18.
  • a general-purpose or special-purpose processor 18 may be stored on a computer readable storage medium 20 (disk, memory and so on) to be executed by such a processor 18.
  • the processor 18 is configured to read from and write to the memory 20.
  • the processor 18 may also comprise an output interface via which data and/or commands are output by the processor 18 and an input interface via which data and/or commands are input to the processor 18.
  • the memory 20 stores a computer program 22 comprising computer program instructions that control the operation of the apparatus 10 when loaded into the processor 18.
  • the computer program instructions 22 provide the logic and routines that enables the apparatus 10 to perform the methods illustrated in Fig. 3 and described in the following paragraphs.
  • the processor 18 by reading the memory 20 is able to load and execute the computer program 22.
  • the computer program 22 may arrive at the apparatus 10 via any suitable delivery mechanism 24.
  • the delivery mechanism 24 may be, for example, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a compact disc read-only memory (CD-ROM) or digital versatile disc (DVD), an article of manufacture that tangibly embodies the computer program 22.
  • the delivery mechanism 24 may be a signal configured to reliably transfer the computer program 22.
  • the apparatus 10 may propagate or transmit the computer program 22 as a computer data signal.
  • the memory 20 stores a plurality of datasets 26.
  • the plurality of datasets 26 may be formed from a plurality of separate databases (that is, the plurality of datasets 26 is provided by a plurality of separate database files). In other examples, the plurality of datasets 26 may be formed from separate data matrices within a single database (that is, the plurality of datasets 26 is provided a single database file).
  • the memory 20 may receive new datasets which are then stored in the memory 20.
  • the input apparatus 14 may comprise any suitable apparatus for enabling a user to provide an input signal to the controller 12.
  • the input apparatus 14 may include at least one of a keyboard, a keypad, a computer mouse, and a touch screen display.
  • the controller 12 is arranged to receive input signals from the input apparatus 14.
  • the output apparatus 16 may comprise any suitable apparatus for providing information to a user.
  • the output apparatus 16 may include at least one display (such as a liquid crystal display or a light emitting diode display).
  • the controller 12 is arranged to control the output apparatus 16 to provide information to the user.
  • Fig. 2 illustrates a multiplex graph 28 for a plurality of datasets 26 according to an example.
  • a multiplex graph is a visual representation of a plurality of datasets whereby the nodes of the multiplex graph represent entities of a dataset, and connections between the nodes represent links between the entities.
  • the entities of a dataset are provided in a vertical column (representing the contents of the dataset) and the columns for the datasets are positioned adjacent one another.
  • the plurality of datasets 26 includes a first dataset 261 for 'IT Support', a second dataset 262 for 'Operations', a third dataset 263 for 'Service Management', and a fourth dataset 264 for 'Service Marketing'.
  • the first dataset 261 is represented by a first graph 30 and includes the entities: host 32 and log 34.
  • the log entities 34 are connected to the host entities 32 via non-probabilistic links (that is, a link having a probability of 100%, in other words, a certain relationship).
  • non-probabilistic links that is, a link having a probability of 100%, in other words, a certain relationship.
  • the non-probabilistic links are indicated by solid lines.
  • the second dataset 262 is represented by a second graph 36 and includes the entities: host 32, chassis 38, disk image 40 and host monitoring metrics 42.
  • the disk image entities 40 and the host monitoring metric entities 42 are connected to the host entities 32 via non-probabilistic links.
  • the chassis entity 38 interconnects the host entities 32 via non- probabilistic links.
  • the third dataset 263 is represented by a third graph 44 and includes the entities: host 32, service 46, service description 48, and service monitoring metrics 50.
  • the service entities 46 are connected to one another and to host entities 32, service description entities 48 and to service monitoring metric entities 50 via non-probabilistic links.
  • the fourth dataset 264 is represented by a fourth graph 52 and includes the entities: service 46, service description 48 and service SLA 54.
  • the service entities 46 are connected to one another and to the service description entities 48 and to service SLA entities 54 via non- probabilistic links.
  • the host entities 32 of the first graph 30 are connected to the host entities 32 of the second graph 36 via probabilistic links (that is, a link having a probability greater than 0% and less than 100%, in other words, an uncertain relationship). As illustrated in Fig. 2, the probabilistic links are indicated by dashed lines.
  • the host entities 32 of the second graph 30 are connected to the host entities 32 of the third graph 44 via probabilistic links.
  • the service entities 46 of the third graph 44 are connected to the service entities 46 of the fourth graph 52 via probabilistic links.
  • the first, second, third and fourth graphs 30, 36, 44, and 52 share some entities.
  • the first, second and third graphs comprise host entities 32.
  • the shared entities may have at least one difference in the different graphs.
  • the host entities 32 may have a different entity name in each of the first, second and third graphs 30, 36 and 44.
  • the non-probabilistic links may be defined by the provider (or providers) of the datasets 261 , 262, 263, 264.
  • the probabilistic links are defined in accordance with the methods illustrated in Fig. 3 and described in the following paragraphs.
  • Fig. 3 illustrates a flow diagram of a method to determine relationships between entities in a plurality of datasets according to an example. In the following paragraphs, the method is described with reference to the datasets illustrated in Fig. 2 as an example.
  • the controller 12 initiates the method to determine at least one relationship between entities in a plurality of datasets. The method may be initiated by a user. For example, a user may use the input apparatus 14 to provide a control signal to the controller 12 to initiate the method. The method may additionally or alternatively be initiated by the controller 12. For example, the controller 12 may determine that a new dataset has been stored in the memory 20 and may then initiate the method in response. [0035] At block 56, the controller 12 selects a first entity from a first dataset, where the first dataset comprises at least the first entity. For example, the controller 12 may select a host entity 32 in the IT Support dataset 261 .
  • the controller 12 selects a second entity from a second dataset, where the second dataset comprises at least the second entity.
  • the controller 12 may select a host entity 32 in the operations dataset 262.
  • the controller 12 determines a probability that the first entity and the second entity are related by comparing a plurality of characteristics of the first entity with a plurality of characteristics of the second entity.
  • a characteristic of an entity is information that, at least in part, defines the entity and may enable differentiation between entities.
  • a characteristic of an entity may be information that (at least in part) identifies the entity.
  • a characteristic may identify the entity by including a name of the entity, or a brief description of the entity.
  • a characteristic of an entity may be information that provides an attribute of the entity.
  • such a characteristic may be a physical location or may be available memory at a node.
  • a characteristic of an entity may be information that provides the structure of the dataset connected to that entity. For example, such a characteristic may provide a list of entities that are connected to the entity via non-probabilistic links.
  • an entity has the characteristics: entity name (for example node name or internet protocol (IP)), entity text description (for example, a short plain text description of the entity), entity attributes (for example, available memory in a node or physical location of the entity), text description of an entity attribute (for example, a short text description of an attribute of the entity), entity connections (that is, dataset structure connected to the entity), and links to available online resources (for example, a link to an online encyclopaedia).
  • entity text description, the text description of an entity attribute and the links to available online resources may be metadata which may be provided by the owner or provider of the dataset.
  • an entity may have different characteristics to those described in this paragraph.
  • the probability may be determined from matches between the plurality of characteristics of the first entity and the plurality of characteristics of the second entity and a plurality of weighting factors for the plurality of characteristics.
  • the probability may be calculated as the sum of weighting factors multiplied by matching result of a characteristic between entities.
  • the matching result of a characteristic between entities may be 1 where there is a match, and may be 0 where there is no match.
  • the weighting factors may have values of less than 1 and may be dynamically altered by the controller 12 using feedback (as explained in greater detail in later paragraphs).
  • the probability may be determined from the following equation:
  • the result of a match is a binary function, s, indicating whether there was a match or not.
  • Entity name matching may be performed by case insensitive UTF-8 string matching.
  • a name match does not necessarily mean that the entities are the same (e.g. machine 1 can be a name given to a machine by different providers, but they refer to their own machine 1 ).
  • Such random matches may occur in relatively large datasets, but since most names will not be the same just by chance, most names will not be the same and the weight of that feature may be gradually decreased (as described below with reference to block 66).
  • NER Named Entity Extraction
  • POS Part of Speech
  • NER/POS may also be applied to attribute descriptions. If there is a match between the attribute description and the name of any two attributes in the entity, their values are compared. If there is a match in the value (literal case insensitive character match for text and numeric value for numbers), then Sattribute_descri P tion is set to 1 . If no attribute description was provided in the metadata, only the name of the attribute is used for deciding whether or not the value of the attribute should be compared.
  • One approach to matching dataset structural features comprises checking whether two entities in the two separate datasets are connected to the same entities. This approach may require a second pass over the whole dataset: once some potential multiplex have been identified, the dataset structure is analysed to detect if an entity has the same connections across data sets (for example, the service entity 46 is linked to the same service description entity 48 in the datasets 263 and 264.
  • the controller 12 may determine whether the probability (determined in block 60) exceeds a threshold probability value. In some examples, a user may operate the input apparatus 14 to set the threshold probability value. In other examples, the threshold probability value may be defined during manufacturing and stored in the memory 20. [0049] At block 64, the controller 12 may create a probabilistic link between the first entity of the first dataset and the second entity of the second dataset using the probability determined in block 60. Where the method includes block 62, block 64 is performed when the determined probability is greater than the threshold probability value, and block 64 is not performed (that is, no probabilistic link is created) when the determined probability is less than the threshold probability value.
  • the controller 12 determines that there is a certain relationship between the entities and then creates a non-probabilistic link between the two entities. If no match is found between the entities for the characteristic 'link to available online resources', the controller 12 continues with block 60 by determining whether there are any matches for other characteristics. [0051 ] At block 66, the controller 12 may dynamically adjust a weighting factor associated with a first characteristic of the plurality of characteristics in dependence on the frequency of matching of the first characteristic.
  • the weighting factors for the characteristics may be the same.
  • the weighting factor for example, w na me for that characteristic is increased.
  • the controller 12 may receive a user input signal indicative of whether a user confirms or cancels a created probabilistic link between the first entity and the second entity. For example, the controller 12 may control the output apparatus 16 to display the multiplex graph 28 including the newly created probabilistic link. The user may operate the input apparatus 14 to either confirm the newly created probabilistic link or cancel the newly created probabilistic link.
  • the controller 12 adjusts at least one of the plurality of weighting factors indicating a match between the first entity and the second entity, using the user input signal received in block 68. For example, where the user confirms a probabilistic link between the host entities 32 in the first graph 30 and the host entities 32 in the second graph 36, and the characteristics entity name and entity attributes have been matched, the controller 12 increases the weighting factors for the characteristics: entity name and entity attributes. By way of another example, where the user cancels a probabilistic link between the host entities 32 in the first graph 30 and the host entities 32 in the second graph 36, and the characteristics entity name and entity attributes have been matched, the controller 12 decreases the weighting factors for the characteristics: entity name and entity attributes.
  • the probability of the remaining probabilistic links may be recalculated using the adjusted weighting factors.
  • the controller 12 automatically proceeds to create probabilistic links between the remaining entities. For example, if there are one million multiplex links between datasets, the user may confirm a subset of those links (ten for example) and the controller 12 then automatically creates the remaining probabilistic links, given a target probable error (which may be set by the user using the input apparatus 14 for the automated creation).
  • the methods illustrated in Fig. 3 and described in the preceding paragraphs may provide an advantage in that they enable the creation of multiplex graphs including non-probabilistic links and probabilistic links between entities in a plurality of datasets. Furthermore, the weighting of different characteristics may be dynamically adjusted in accordance with feedback (either from successful matches or from a user) to improve the accuracy of the probabilistic links, and to improve the decision as to whether a probabilistic link should be created (where probabilistic links are created when the determined probability is greater than the threshold probability value). Additionally, the methods illustrated in Fig. 3 may be performed automatically by the controller 12 and without user intervention.
  • Fig. 3 may represent steps in a method and/or sections of code in the computer program 22.
  • the illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied in some examples.
  • Fig. 4 illustrates a flow diagram of another method according to an example. The flow diagram includes blocks 56, 58, 60 and 64 illustrated in Fig. 3 and omits blocks 62, 66, 68 and 70 illustrated in Fig. 3.
  • Fig. 5 illustrates a schematic diagram of a controller 12 according to an example.
  • the controller 12 includes a probabilistic link creation module 72, a first dataset 74, a second dataset 76, and a user input module 78.
  • the probabilistic link creation module 72 is to perform blocks 56, 58, 60, 62, 64, 66 and 70.
  • the user input module 78 is to perform block 68.
  • the probabilistic link creation module 72 and the user input module 78 may be software modules that are executable by the controller 12.
  • references to 'computer-readable storage medium', 'computer program product', 'tangibly embodied computer program' etc. or a 'controller', 'computer', 'processor' etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field- programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method to determine relationships between entities in a plurality of datasets in which a first entity is selected from a first dataset, the first dataset comprising at least the first entity. A second entity is selected from a second dataset, the second dataset comprising at least the second entity. A probability that the first entity and the second entity are related is determined by comparing a plurality of characteristics of the first entity with a plurality of characteristics of the second entity. In one example, a probabilistic link between the first entity of the first dataset and the second entity of the second dataset is created using the determined probability.

Description

[0001JTITLE
[0002] Determine relationships between entities in datasets [0003] BACKGROUND
[0004] Datasets are arranged to store information in a structured manner. For example, a dataset for a social network may store various details concerning a user, such as, user name, user date of birth, user interests, user workplace and so on. Datasets may, at least partially, store substantially the same information but using different wording. For example, in a first dataset, a user's interests may include 'playing musical instruments', whereas, in a second dataset, the same user's interests may include 'musician'. Datasets may also store different information for the same entity. For example, in a first dataset, there may be an entity "Father of two children", and in a second dataset, there may be an entity "computer software engineer" for the same person.
[0005] BRIEF DESCRIPTION
[0006] Reference will now be made by way of example only to the accompanying drawings in which:
[0007] Fig. 1 illustrates a schematic diagram of an apparatus according to an example;
[0008] Fig. 2 illustrates a multiplex graph according to an example;
[0009] Fig. 3 illustrates a flow diagram of a method according to an example;
[0010] Fig. 4 illustrates a flow diagram of another method according to an example; and [0011] Fig. 5 illustrates a schematic diagram of a controller according to an example. [0012] DETAILED DESCRIPTION
[0013] A user may find datasets difficult to manage (for example, search, edit), particularly as the size of the datasets increases. To aid the user, a dataset may be visually represented by a graph where nodes represent entities in the dataset, and connections represent the links (also referred to as 'edges') between the nodes. However, the datasets may, at least partially, store substantially the same information but using different wording. This may be challenging for the user to identify. [0014]As described in the following paragraphs, probabilistic links may be created between entities in different datasets to enable a user to determine relationships between the datasets and to handle the data in the datasets more efficiently. [0015] Fig. 1 illustrates a schematic diagram of an apparatus 10 including a controller 12, input apparatus 14, and output apparatus 16. The apparatus 10 may also be referred to as a "computer apparatus", a "computer" or a "data storage apparatus". In some examples, the apparatus 10 may be a single machine where the input apparatus 14 and the output apparatus 16 are connected to the controller 12 via a wired or a wireless link, and the controller 12, the input apparatus 14 and the output apparatus 16 are located in close proximity to one another (for example, in the same room as one another). In other examples, the apparatus 10 may be a distributed apparatus where the input apparatus 14 and the output apparatus 16 are located remotely from the controller 12 (for example, in a different room, in a different building, in a different city, or in different country). [0016] In some examples, the apparatus 10 may be a module. As used herein, 'module' refers to a unit or apparatus that excludes certain parts/components that would be added by an end manufacturer or a user. For example, where the apparatus 10 is a module, the apparatus 10 may comprise the controller 12 and the remaining components (namely, the input apparatus 14 and the output apparatus 16) may be added by an end manufacturer. [0017]The implementation of the controller 12 can be in hardware alone (for example, a circuit, a processor and so on), have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware). [0018]The controller 12 may be implemented using instructions that enable hardware functionality, for example, by using executable computer program instructions 22 in a general-purpose or special-purpose processor 18 that may be stored on a computer readable storage medium 20 (disk, memory and so on) to be executed by such a processor 18.
[0019]The processor 18 is configured to read from and write to the memory 20. The processor 18 may also comprise an output interface via which data and/or commands are output by the processor 18 and an input interface via which data and/or commands are input to the processor 18.
[0020] The memory 20 stores a computer program 22 comprising computer program instructions that control the operation of the apparatus 10 when loaded into the processor 18. The computer program instructions 22 provide the logic and routines that enables the apparatus 10 to perform the methods illustrated in Fig. 3 and described in the following paragraphs. The processor 18 by reading the memory 20 is able to load and execute the computer program 22. [0021]The computer program 22 may arrive at the apparatus 10 via any suitable delivery mechanism 24. The delivery mechanism 24 may be, for example, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a compact disc read-only memory (CD-ROM) or digital versatile disc (DVD), an article of manufacture that tangibly embodies the computer program 22. The delivery mechanism 24 may be a signal configured to reliably transfer the computer program 22. The apparatus 10 may propagate or transmit the computer program 22 as a computer data signal.
[0022]The memory 20 stores a plurality of datasets 26. The plurality of datasets 26 may be formed from a plurality of separate databases (that is, the plurality of datasets 26 is provided by a plurality of separate database files). In other examples, the plurality of datasets 26 may be formed from separate data matrices within a single database (that is, the plurality of datasets 26 is provided a single database file). The memory 20 may receive new datasets which are then stored in the memory 20. [0023]The input apparatus 14 may comprise any suitable apparatus for enabling a user to provide an input signal to the controller 12. For example, the input apparatus 14 may include at least one of a keyboard, a keypad, a computer mouse, and a touch screen display. The controller 12 is arranged to receive input signals from the input apparatus 14.
[0024]The output apparatus 16 may comprise any suitable apparatus for providing information to a user. For example, the output apparatus 16 may include at least one display (such as a liquid crystal display or a light emitting diode display). The controller 12 is arranged to control the output apparatus 16 to provide information to the user. [0025] Fig. 2 illustrates a multiplex graph 28 for a plurality of datasets 26 according to an example. A multiplex graph is a visual representation of a plurality of datasets whereby the nodes of the multiplex graph represent entities of a dataset, and connections between the nodes represent links between the entities. The entities of a dataset are provided in a vertical column (representing the contents of the dataset) and the columns for the datasets are positioned adjacent one another. The plurality of datasets 26 includes a first dataset 261 for 'IT Support', a second dataset 262 for 'Operations', a third dataset 263 for 'Service Management', and a fourth dataset 264 for 'Service Marketing'.
[0026]The first dataset 261 is represented by a first graph 30 and includes the entities: host 32 and log 34. The log entities 34 are connected to the host entities 32 via non-probabilistic links (that is, a link having a probability of 100%, in other words, a certain relationship). As illustrated in Fig. 2, the non-probabilistic links are indicated by solid lines.
[0027] The second dataset 262 is represented by a second graph 36 and includes the entities: host 32, chassis 38, disk image 40 and host monitoring metrics 42. The disk image entities 40 and the host monitoring metric entities 42 are connected to the host entities 32 via non-probabilistic links. The chassis entity 38 interconnects the host entities 32 via non- probabilistic links. [0028]The third dataset 263 is represented by a third graph 44 and includes the entities: host 32, service 46, service description 48, and service monitoring metrics 50. The service entities 46 are connected to one another and to host entities 32, service description entities 48 and to service monitoring metric entities 50 via non-probabilistic links.
[0029] The fourth dataset 264 is represented by a fourth graph 52 and includes the entities: service 46, service description 48 and service SLA 54. The service entities 46 are connected to one another and to the service description entities 48 and to service SLA entities 54 via non- probabilistic links. [0030] The host entities 32 of the first graph 30 are connected to the host entities 32 of the second graph 36 via probabilistic links (that is, a link having a probability greater than 0% and less than 100%, in other words, an uncertain relationship). As illustrated in Fig. 2, the probabilistic links are indicated by dashed lines. The host entities 32 of the second graph 30 are connected to the host entities 32 of the third graph 44 via probabilistic links. The service entities 46 of the third graph 44 are connected to the service entities 46 of the fourth graph 52 via probabilistic links. The service description entities 48 of the third graph 44 and connected to the service description entities 48 of the fourth graph 52 via probabilistic links.
[0031]The first, second, third and fourth graphs 30, 36, 44, and 52 share some entities. For example, the first, second and third graphs comprise host entities 32. However, the shared entities may have at least one difference in the different graphs. For example, the host entities 32 may have a different entity name in each of the first, second and third graphs 30, 36 and 44.
[0032]The non-probabilistic links may be defined by the provider (or providers) of the datasets 261 , 262, 263, 264. The probabilistic links are defined in accordance with the methods illustrated in Fig. 3 and described in the following paragraphs.
[0033] Fig. 3 illustrates a flow diagram of a method to determine relationships between entities in a plurality of datasets according to an example. In the following paragraphs, the method is described with reference to the datasets illustrated in Fig. 2 as an example. [0034] At block 56, the controller 12 initiates the method to determine at least one relationship between entities in a plurality of datasets. The method may be initiated by a user. For example, a user may use the input apparatus 14 to provide a control signal to the controller 12 to initiate the method. The method may additionally or alternatively be initiated by the controller 12. For example, the controller 12 may determine that a new dataset has been stored in the memory 20 and may then initiate the method in response. [0035] At block 56, the controller 12 selects a first entity from a first dataset, where the first dataset comprises at least the first entity. For example, the controller 12 may select a host entity 32 in the IT Support dataset 261 .
[0036] At block 58, the controller 12 selects a second entity from a second dataset, where the second dataset comprises at least the second entity. For example, the controller 12 may select a host entity 32 in the operations dataset 262.
[0037] At block 60, the controller 12 determines a probability that the first entity and the second entity are related by comparing a plurality of characteristics of the first entity with a plurality of characteristics of the second entity.
[0038]A characteristic of an entity (which may also be referred to as a 'feature' of the entity) is information that, at least in part, defines the entity and may enable differentiation between entities. For example, a characteristic of an entity may be information that (at least in part) identifies the entity. For example, such a characteristic may identify the entity by including a name of the entity, or a brief description of the entity. By way of another example, a characteristic of an entity may be information that provides an attribute of the entity. For example, such a characteristic may be a physical location or may be available memory at a node. By way of a further example, a characteristic of an entity may be information that provides the structure of the dataset connected to that entity. For example, such a characteristic may provide a list of entities that are connected to the entity via non-probabilistic links.
[0039] In one example, an entity has the characteristics: entity name (for example node name or internet protocol (IP)), entity text description (for example, a short plain text description of the entity), entity attributes (for example, available memory in a node or physical location of the entity), text description of an entity attribute (for example, a short text description of an attribute of the entity), entity connections (that is, dataset structure connected to the entity), and links to available online resources (for example, a link to an online encyclopaedia). The entity text description, the text description of an entity attribute and the links to available online resources may be metadata which may be provided by the owner or provider of the dataset. In other examples, an entity may have different characteristics to those described in this paragraph.
[0040]The probability may be determined from matches between the plurality of characteristics of the first entity and the plurality of characteristics of the second entity and a plurality of weighting factors for the plurality of characteristics. In more detail, the probability may be calculated as the sum of weighting factors multiplied by matching result of a characteristic between entities.
[0041]The matching result of a characteristic between entities may be 1 where there is a match, and may be 0 where there is no match. The weighting factors may have values of less than 1 and may be dynamically altered by the controller 12 using feedback (as explained in greater detail in later paragraphs). [0042] For example, the probability may be determined from the following equation:
[0043] p =
Wjiame ^ ^name ^entity description ^ ^entity description ^attribute description ^
^attribute description ^structure ^ ^structure
[0044] Where 'w' is the current weight for the characteristic and where 's' is the result of matching that characteristic between entities.
[0045] In some examples, the result of a match is a binary function, s, indicating whether there was a match or not. Entity name matching may be performed by case insensitive UTF-8 string matching. However, a name match does not necessarily mean that the entities are the same (e.g. machine 1 can be a name given to a machine by different providers, but they refer to their own machine 1 ). Such random matches may occur in relatively large datasets, but since most names will not be the same just by chance, most names will not be the same and the weight of that feature may be gradually decreased (as described below with reference to block 66).
[0046]The description of the two entities may be analysed by a Named Entity Extraction (NER) and a Part of Speech (POS) parser. If the entity parser offers the same result, sentity_descriPtion is set to 1 . NER/POS may also be applied to attribute descriptions. If there is a match between the attribute description and the name of any two attributes in the entity, their values are compared. If there is a match in the value (literal case insensitive character match for text and numeric value for numbers), then Sattribute_descriPtion is set to 1 . If no attribute description was provided in the metadata, only the name of the attribute is used for deciding whether or not the value of the attribute should be compared. [0047] One approach to matching dataset structural features comprises checking whether two entities in the two separate datasets are connected to the same entities. This approach may require a second pass over the whole dataset: once some potential multiplex have been identified, the dataset structure is analysed to detect if an entity has the same connections across data sets (for example, the service entity 46 is linked to the same service description entity 48 in the datasets 263 and 264.
[0048] At block 62, the controller 12 may determine whether the probability (determined in block 60) exceeds a threshold probability value. In some examples, a user may operate the input apparatus 14 to set the threshold probability value. In other examples, the threshold probability value may be defined during manufacturing and stored in the memory 20. [0049] At block 64, the controller 12 may create a probabilistic link between the first entity of the first dataset and the second entity of the second dataset using the probability determined in block 60. Where the method includes block 62, block 64 is performed when the determined probability is greater than the threshold probability value, and block 64 is not performed (that is, no probabilistic link is created) when the determined probability is less than the threshold probability value.
[0050] In some examples, if a match is found between entities for the characteristic 'link to available online resources' (for example, the entities both have metadata that includes the same link to an online encyclopaedia), the controller 12 determines that there is a certain relationship between the entities and then creates a non-probabilistic link between the two entities. If no match is found between the entities for the characteristic 'link to available online resources', the controller 12 continues with block 60 by determining whether there are any matches for other characteristics. [0051 ] At block 66, the controller 12 may dynamically adjust a weighting factor associated with a first characteristic of the plurality of characteristics in dependence on the frequency of matching of the first characteristic. For example, when relationships are determined for two datasets, the weighting factors for the characteristics may be the same. When there is a match in one of the characteristics (for example, entity name, sname), the weighting factor (for example, wname) for that characteristic is increased.
[0052] At block 68, the controller 12 may receive a user input signal indicative of whether a user confirms or cancels a created probabilistic link between the first entity and the second entity. For example, the controller 12 may control the output apparatus 16 to display the multiplex graph 28 including the newly created probabilistic link. The user may operate the input apparatus 14 to either confirm the newly created probabilistic link or cancel the newly created probabilistic link.
[0053] At block 70, the controller 12 adjusts at least one of the plurality of weighting factors indicating a match between the first entity and the second entity, using the user input signal received in block 68. For example, where the user confirms a probabilistic link between the host entities 32 in the first graph 30 and the host entities 32 in the second graph 36, and the characteristics entity name and entity attributes have been matched, the controller 12 increases the weighting factors for the characteristics: entity name and entity attributes. By way of another example, where the user cancels a probabilistic link between the host entities 32 in the first graph 30 and the host entities 32 in the second graph 36, and the characteristics entity name and entity attributes have been matched, the controller 12 decreases the weighting factors for the characteristics: entity name and entity attributes. In some examples, the probability of the remaining probabilistic links may be recalculated using the adjusted weighting factors. [0054] In some examples, where the user has confirmed a number of probabilistic relationships between the same type of entities, the controller 12 automatically proceeds to create probabilistic links between the remaining entities. For example, if there are one million multiplex links between datasets, the user may confirm a subset of those links (ten for example) and the controller 12 then automatically creates the remaining probabilistic links, given a target probable error (which may be set by the user using the input apparatus 14 for the automated creation).
[0055]The methods illustrated in Fig. 3 and described in the preceding paragraphs may provide an advantage in that they enable the creation of multiplex graphs including non-probabilistic links and probabilistic links between entities in a plurality of datasets. Furthermore, the weighting of different characteristics may be dynamically adjusted in accordance with feedback (either from successful matches or from a user) to improve the accuracy of the probabilistic links, and to improve the decision as to whether a probabilistic link should be created (where probabilistic links are created when the determined probability is greater than the threshold probability value). Additionally, the methods illustrated in Fig. 3 may be performed automatically by the controller 12 and without user intervention.
[0056]The blocks illustrated in the Fig. 3 may represent steps in a method and/or sections of code in the computer program 22. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied in some examples. Furthermore, it may be possible for some blocks to be omitted in some examples. For example, Fig. 4 illustrates a flow diagram of another method according to an example. The flow diagram includes blocks 56, 58, 60 and 64 illustrated in Fig. 3 and omits blocks 62, 66, 68 and 70 illustrated in Fig. 3. [0057]Although examples have been described in the preceding paragraphs, it should be appreciated that modifications to the examples given can be made without departing from the scope as claimed. [0058] Although the processor 18 is illustrated as a single component it may be implemented as one or more separate components some or all of which may be integrated/removable and/or may provide permanent/semipermanent/ dynamic/cached storage. [0059]Although the memory 20 is illustrated as a single component it may be implemented as one or more separate components some or all of which may be integrated/removable and/or may provide permanent/semipermanent/ dynamic/cached storage. [0060] Fig. 5 illustrates a schematic diagram of a controller 12 according to an example. The controller 12 includes a probabilistic link creation module 72, a first dataset 74, a second dataset 76, and a user input module 78. The probabilistic link creation module 72 is to perform blocks 56, 58, 60, 62, 64, 66 and 70. The user input module 78 is to perform block 68. The probabilistic link creation module 72 and the user input module 78 may be software modules that are executable by the controller 12.
[0061] References to 'computer-readable storage medium', 'computer program product', 'tangibly embodied computer program' etc. or a 'controller', 'computer', 'processor' etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field- programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc. [0062] Features described in the preceding description may be used in combinations other than the combinations explicitly described.
[0063]Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.
[0064]Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.
[0065] Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of particular importance it should be understood that the Applicant claims protection in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not particular emphasis has been placed thereon.
[0066]What is claimed is:

Claims

1 . A method to determine relationships between entities in a plurality of datasets, the method comprising:
selecting a first entity from a first dataset, the first dataset comprising at least the first entity;
selecting a second entity from a second dataset, the second dataset comprising at least the second entity;
determining a probability that the first entity and the second entity are related by comparing a plurality of characteristics of the first entity with a plurality of characteristics of the second entity; and
creating a probabilistic link between the first entity of the first dataset and the second entity of the second dataset using the determined probability.
2. The method as claimed in claim 1 , further comprising: determining whether the probability exceeds a threshold probability value; and creating a probabilistic link between the first entity of the first dataset and the second entity of the second dataset where it is determined that the probability exceeds the threshold probability value.
3. The method as claimed in claim 2, wherein the threshold probability value is set by a user.
4. The method as claimed in claim 1 , wherein the probability is determined from matches between the plurality of characteristics of the first entity and the plurality of characteristics of the second entity and a plurality of weighting factors for the plurality of characteristics.
5. The method as claimed in claim 4, further comprising dynamically adjusting a weighting factor associated with a first characteristic of the plurality of characteristics in dependence on the frequency of matching the first characteristic.
6. The method as claimed in claim 4, further comprising receiving a user input signal indicative of whether a user confirms or cancels a created probabilistic link between the first entity and the second entity; and adjusting at least one of the plurality of weighting factors indicating a match between the first entity and the second entity, using the received user input signal.
7. An apparatus to determine relationships between entities in a plurality of datasets, the apparatus comprising:
a controller to:
select a first entity from a first dataset, the first dataset comprising at least the first entity;
select a second entity from a second dataset, the second dataset comprising at least the second entity;
determine a probability that the first entity and the second entity are related by comparing a plurality of characteristics of the first entity with a plurality of characteristics of the second entity,
wherein the probability is determined from matches between the plurality of characteristics of the first entity and the plurality of characteristics of the second entity and a plurality of weighting factors for the plurality of characteristics.
8. The apparatus as claimed in claim 7, wherein the controller is to create a probabilistic link between the first entity of the first dataset and the second entity of the second dataset.
9. The apparatus as claimed in claim 7, wherein the controller is to determine whether the probability exceeds a threshold probability value; and to create a probabilistic link between the first entity of the first dataset and the second entity of the second dataset where it is determined that the probability exceeds the threshold probability value.
10. The apparatus as claimed in claim 9, wherein the threshold probability value is set by a user.
1 1 . The apparatus as claimed in claim 7, wherein the controller is to dynamically adjust a weighting associated with a first characteristic of the plurality of characteristics in dependence on the frequency of matching the first characteristic.
12. The apparatus as claimed in claim 7, wherein the controller is to receive a user input signal indicative of whether a user confirms or cancels a created link between the first entity and the second entity; and adjusting at least one of the plurality of weighting factors indicating a match between the first entity and the second entity, using the received user input signal.
13. A non-transitory computer-readable storage medium encoded with instructions to determine relationships between entities in a plurality of datasets that, when performed by a processor, cause performance of:
select a first entity from a first dataset, the first dataset comprising at least the first entity;
select a second entity from a second dataset, the second dataset comprising at least the second entity;
determine a probability that the first entity and the second entity are related by comparing a plurality of characteristics of the first entity with a plurality of characteristics of the second entity; and
create a probabilistic link between the first entity of the first dataset and the second entity of the second dataset using the determined probability.
14. The non-transitory computer-readable storage medium as claimed in claim 13 encoded with instructions that, when performed by a processor, cause performance of: determine whether the probability exceeds a threshold probability value; and create a probabilistic link between the first entity of the first dataset and the second entity of the second dataset where it is determined that the probability exceeds the threshold probability value.
15. The non-transitory computer-readable storage medium as claimed in claim 13, wherein the probability is determined from matches between the plurality of characteristics of the first entity and the plurality of characteristics of the second entity and a plurality of weighting factors for the plurality of characteristics.
PCT/EP2014/058525 2014-04-25 2014-04-25 Determine relationships between entities in datasets WO2015161899A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2014/058525 WO2015161899A1 (en) 2014-04-25 2014-04-25 Determine relationships between entities in datasets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2014/058525 WO2015161899A1 (en) 2014-04-25 2014-04-25 Determine relationships between entities in datasets

Publications (1)

Publication Number Publication Date
WO2015161899A1 true WO2015161899A1 (en) 2015-10-29

Family

ID=50628806

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2014/058525 WO2015161899A1 (en) 2014-04-25 2014-04-25 Determine relationships between entities in datasets

Country Status (1)

Country Link
WO (1) WO2015161899A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10394896B2 (en) 2016-11-28 2019-08-27 International Business Machines Corporation Identifying relationships of interest of entities
US20210256075A1 (en) * 2017-10-06 2021-08-19 Realpage, Inc. Concept networks and systems and methods for the creation, update and use of same in artificial intelligence systems

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998055947A1 (en) * 1997-06-06 1998-12-10 Madison Information Technologies, Inc. System and method for indexing information about entities from different information sources
US20090271359A1 (en) * 2008-04-24 2009-10-29 Lexisnexis Risk & Information Analytics Group Inc. Statistical record linkage calibration for reflexive and symmetric distance measures at the field and field value levels without the need for human interaction
US7792864B1 (en) * 2006-06-14 2010-09-07 TransUnion Teledata, L.L.C. Entity identification and/or association using multiple data elements
US7912842B1 (en) * 2003-02-04 2011-03-22 Lexisnexis Risk Data Management Inc. Method and system for processing and linking data records

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998055947A1 (en) * 1997-06-06 1998-12-10 Madison Information Technologies, Inc. System and method for indexing information about entities from different information sources
US7912842B1 (en) * 2003-02-04 2011-03-22 Lexisnexis Risk Data Management Inc. Method and system for processing and linking data records
US7792864B1 (en) * 2006-06-14 2010-09-07 TransUnion Teledata, L.L.C. Entity identification and/or association using multiple data elements
US20090271359A1 (en) * 2008-04-24 2009-10-29 Lexisnexis Risk & Information Analytics Group Inc. Statistical record linkage calibration for reflexive and symmetric distance measures at the field and field value levels without the need for human interaction

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10394896B2 (en) 2016-11-28 2019-08-27 International Business Machines Corporation Identifying relationships of interest of entities
US10394895B2 (en) 2016-11-28 2019-08-27 International Business Machines Corporation Identifying relationships of interest of entities
US11074298B2 (en) 2016-11-28 2021-07-27 International Business Machines Corporation Identifying relationships of interest of entities
US11074299B2 (en) 2016-11-28 2021-07-27 International Business Machines Corporation Identifying relationships of interest of entities
US20210256075A1 (en) * 2017-10-06 2021-08-19 Realpage, Inc. Concept networks and systems and methods for the creation, update and use of same in artificial intelligence systems
US11768893B2 (en) * 2017-10-06 2023-09-26 Realpage, Inc. Concept networks and systems and methods for the creation, update and use of same in artificial intelligence systems

Similar Documents

Publication Publication Date Title
US11526809B2 (en) Primary key-foreign key relationship determination through machine learning
US11062215B2 (en) Using different data sources for a predictive model
US9607063B1 (en) NoSQL relational database (RDB) data movement
US10698799B2 (en) Indicating a readiness of a change for implementation into a computer program
US20100077301A1 (en) Systems and methods for electronic document review
WO2016171885A1 (en) Distributed processing of shared content
CN111247518A (en) Database sharding
US20180365131A1 (en) Dynamically generated device test pool for staged rollouts of software applications
JP2017515249A (en) System and method for displaying an estimated relevance indicator for a result document set and for displaying a query visualization
US11507549B2 (en) Data normalization system
US10789113B2 (en) Data storage system durability using hardware failure risk indicators
US10380124B2 (en) Searching data sets
US10338972B1 (en) Prefix based partitioned data storage
CN110851438B (en) A method and device for database index optimization suggestion and verification
CN112860840A (en) Search processing method, device, equipment and storage medium
US11715005B2 (en) Verification and identification of a neural network
WO2015161899A1 (en) Determine relationships between entities in datasets
CN114265865A (en) Data query method, system, electronic device and storage medium
CN113760766A (en) MPI parameter tuning method and device, storage medium and electronic equipment
CN105653355A (en) Method and system for calculating Hadoop configuration parameters
CN105324768A (en) Dynamic query resolution using accuracy profiles
CN118537034A (en) Diving mining method and device and computing equipment
US20170180511A1 (en) Method, system and apparatus for dynamic detection and propagation of data clusters
US8812517B1 (en) Watermarking of structured results and watermark detection
US20150347570A1 (en) Consolidating vocabulary for automated text processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14720576

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14720576

Country of ref document: EP

Kind code of ref document: A1