WO2021082100A1 - 知识图谱的实体对齐方法、装置、设备及存储介质 - Google Patents

知识图谱的实体对齐方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021082100A1
WO2021082100A1 PCT/CN2019/119307 CN2019119307W WO2021082100A1 WO 2021082100 A1 WO2021082100 A1 WO 2021082100A1 CN 2019119307 W CN2019119307 W CN 2019119307W WO 2021082100 A1 WO2021082100 A1 WO 2021082100A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
pair
graph
fused
knowledge
Prior art date
Application number
PCT/CN2019/119307
Other languages
English (en)
French (fr)
Inventor
凌岚
刘嘉伟
于修铭
汪伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021082100A1 publication Critical patent/WO2021082100A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Definitions

  • This application relates to the field of knowledge relationship analysis, and in particular to methods, devices, equipment and storage media for entity alignment of knowledge graphs.
  • the current entity alignment method of knowledge graph focuses on modeling entity pairs in different knowledge graphs and studying the probability distribution of entities, but this method is limited to small-scale data and relies on prior distributions. Some people have also proposed an entity alignment method using a greedy algorithm, but this method cannot distinguish entities with the same name but different meanings. The inventor realizes that the efficiency of entity alignment is low, which is not conducive to the fusion of multiple knowledge graphs.
  • This application provides an entity alignment method for knowledge graphs.
  • the similarity value between entities is calculated by PR algorithm (PageRank), and the entity alignment is performed based on the similarity value of the entities, which improves the efficiency of entity alignment and the accuracy of entity alignment .
  • PR algorithm PageRank
  • the first aspect of the embodiments of the present application provides an entity alignment method for knowledge graphs, including: acquiring two entities to be fused knowledge graphs; generating a pair of entities to be aligned based on the two entities to be fused knowledge graphs; The pair of entities to be aligned is generated to generate a pairing graph of the two knowledge graphs to be fused; based on the pairing graph, the similarity value of the pair of entities to be aligned is calculated by the PR algorithm; and the similarity value of the pair of entities to be aligned is performed Entity alignment.
  • a second aspect of the embodiments of the present application provides an entity alignment device for knowledge graphs, including: an acquiring unit, configured to acquire two entities to be fused; a first generating unit, configured according to the two entities to be fused The entity of the knowledge graph generates a pair of entities to be aligned; a second generating unit is configured to generate a pairing graph of the two knowledge graphs to be fused based on the pair of entities to be aligned; a calculation unit is used to generate a pairing graph based on the pairing graph, The similarity value of the entity pair to be aligned is calculated by the PR algorithm; the alignment unit is configured to perform entity alignment according to the similarity value of the entity pair to be aligned.
  • the third aspect of the embodiments of the present application provides an entity alignment device for a knowledge graph, including a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes all
  • the computer program implements the entity alignment method of the knowledge graph described in any of the foregoing embodiments.
  • the fourth aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when the instructions run on a computer, the computer executes any of the above-mentioned implementations.
  • two entities of the knowledge graphs to be fused are obtained; the entity pair to be aligned is generated according to the two entities of the knowledge graph to be fused; and the two entities are generated based on the entity pair to be aligned.
  • the similarity value between entities is calculated by the PR algorithm, and the entity alignment is performed based on the similarity value of the entities, which improves the efficiency of entity alignment and the accuracy of entity alignment.
  • FIG. 1 is a schematic diagram of an embodiment of a method for entity alignment of a knowledge graph in an embodiment of this application;
  • FIG. 2 is a schematic diagram of another embodiment of a method for entity alignment of a knowledge graph in an embodiment of this application;
  • FIG. 3 is a schematic diagram of an embodiment of the entity alignment device of the knowledge graph in an embodiment of the application
  • FIG. 4 is a schematic diagram of another embodiment of the entity alignment device of the knowledge graph in the embodiment of this application.
  • FIG. 5 is a schematic diagram of an embodiment of the entity alignment device of the knowledge graph in an embodiment of the application.
  • the embodiments of the present application provide a method, device, device, and storage medium for entity alignment of a knowledge graph.
  • the similarity value between entities is calculated by the PR algorithm, and entity alignment is performed based on the similarity value of the entities, which improves the efficiency of entity alignment and the entity The accuracy of the alignment.
  • FIG. 1 a flowchart of a method for entity alignment of a knowledge graph provided by an embodiment of the present application, which specifically includes:
  • the server obtains the entities of two knowledge graphs to be fused, and the two knowledge graphs to be fused are any two knowledge graphs to be fused.
  • a complete knowledge graph needs to merge multiple knowledge graphs to classify entities with the same meaning.
  • the same entities may have different meanings, such as the entity "apple”.
  • Fruits are represented in knowledge graph A, and electronic products are represented in knowledge graph B; different entities may have the same meaning in different knowledge graphs, such as the entity “Elvis” in knowledge graph C and the entity in knowledge graph D "Elvis Presley” means the same meaning.
  • the server generates a pair of entities to be aligned based on the entities of the two knowledge graphs to be fused. After the server converts and preprocesses the two entities with the fusion knowledge graph, it calculates the initial similarity of all entity pairs, and sets the entity pair whose initial similarity exceeds a preset threshold as a matching entity pair.
  • the specific process of generating the entity pairs to be aligned is: the server performs vector conversion on the two entities to be fused with the knowledge graph through the bag-of-words model to obtain the entity vector; the server preprocesses the entity vector; the server calculates the difference between the entity vector pairs For the initial similarity, the entity vector pair is any two entity vectors; the server sets the entity representation vector pair whose initial similarity exceeds a preset threshold as the entity pair to be aligned.
  • the server preprocesses the entity representation vector, filters out all the word representation vectors with a high inverse document frequency (inverse document frequency, IDF) value in all entity representation vectors, and deletes words with a low IDF value. Specifically, the server calculates the IDF value of the inverse text frequency index of all the word representation vectors in the two knowledge graphs to be fused; the server deletes the word representation vector in the entity representation vector whose IDF value is lower than the preset IDF value.
  • the IDF value is a measure of the universality and importance of a word. If the IDF value of the word is low, it means that the word appears more frequently.
  • the word representation vector In the entity representation vector, it is often the word representation vector that does not need to be distinguished, such as the pause word " ⁇ " and " ⁇ " means a vector.
  • the deletion of the word representation vector with a low IDF value by the server helps to improve the efficiency of entity representation vector matching.
  • the server calculates the initial similarity between the entity representation vector pair after preprocessing, and the entity pair is any two entities. Specifically, after the server determines the same word vector in the target entity vector pair, it uses a preset initial similarity calculation formula to calculate the initial similarity of the target entity vector pair.
  • the preset initial similarity calculation formula is:
  • e 1 and e 2 represent the target entity vector pair
  • represents the same word vector in the target entity vector pair
  • idf 1 ( ⁇ ) represents the IDF value of ⁇ in e 1
  • idf 2 ( ⁇ ) represents ⁇ in e 2 IDF value in.
  • the server generates two pairing graphs of the knowledge graph to be fused based on the entity pairs to be matched.
  • the pairing graph includes the entity pair nodes to be aligned, and the nodes generated based on the triple data in the two knowledge graphs to be fused and the undirected edges connecting the nodes. .
  • the nodes in the pairing graph are the entity-pair nodes to be aligned, and the entity-pair nodes related to the entity-pair nodes to be aligned.
  • Tang Monk and “Tang San Zang” are a pair of aligned entities
  • the entity vector of "Tang Monk” is a
  • the entity vector of "Tang San Zang” is b
  • “Tang Monk” is the entity in the first to-be-fused knowledge map
  • “Tang San Zang” is an entity in the second zone of the fusion knowledge map.
  • In the first to-be-fused knowledge map there are triples (Tang Seng, yes, monk), and in the second to-be-fused knowledge map, there are triples (Tang Sanzang, yes, Master), the entity vector of "Monk” is c
  • the entity vector of "Master” is d.
  • the server calculates the similarity value of the entity pair to be aligned through a page rank (PR) algorithm.
  • PR page rank
  • the PR algorithm calculates the static distribution of nodes based on random walks, calculates the initial similarity value of each node in the pairing graph, and then propagates through the undirected edge to obtain the similarity value between the two nodes.
  • the source node u and the landing node v can be calculated by the PR algorithm to obtain the PPR value PPR (v ⁇ u) from the source node u to the landing node v.
  • the PPR value refers to the wandering from one node to another in a graph Static distribution probability, based on the PPR value and the initial similarity value, the similarity value between the source node u and the landing node v can be obtained.
  • the greater the PPR value the greater the probability of wandering from the source node to the landing node v.
  • the PR algorithm can be combined with the neighbor entity pair information of the entity pair to be aligned to determine whether the entities to be matched are similar, which improves the accuracy of the server in determining whether the entities are similar.
  • the server performs entity alignment according to the similarity value of the entity pair to be aligned. Specifically, the server determines whether the similarity value of the entity pair to be aligned exceeds the preset similarity value; if the similarity value of the target entity pair to be aligned exceeds the preset similarity value, and the target entity pair to be aligned is any entity pair to be aligned, the server will The target entity pair to be aligned is marked as an entity pair that expresses the same meaning; if the similarity value of the target entity pair to be aligned does not exceed a preset threshold, the server determines that the target entity pair to be aligned has different meanings.
  • the server determines that the two entities in a pair of entities to be aligned have the same meaning, the server marks or associates the entity vectors of the two entities to determine that the two entities have the same meaning in the fused knowledge graph entity.
  • the similarity value between entities is calculated by the PR algorithm, and the entity alignment is performed based on the similarity value of the entities, which improves the efficiency of entity alignment and the accuracy of entity alignment.
  • another embodiment of the entity alignment method of the knowledge graph in the embodiment of the present application includes:
  • the server obtains the entities of two knowledge graphs to be fused, and the two knowledge graphs to be fused are any two knowledge graphs to be fused.
  • a complete knowledge graph needs to merge multiple knowledge graphs to classify entities with the same meaning.
  • the same entities may have different meanings, such as the entity "apple”.
  • Fruits are represented in knowledge graph A, and electronic products are represented in knowledge graph B; different entities may have the same meaning in different knowledge graphs, such as the entity “Elvis” in knowledge graph C and the entity in knowledge graph D "Elvis Presley” means the same meaning.
  • the server generates a pair of entities to be aligned based on the entities of the two knowledge graphs to be fused. After the server converts and preprocesses the two entities with the fusion knowledge graph, it calculates the initial similarity of all entity pairs, and sets the entity pair whose initial similarity exceeds a preset threshold as a matching entity pair.
  • the specific process of generating the entity pairs to be aligned is: the server performs vector conversion on the two entities to be fused with the knowledge graph through the bag-of-words model to obtain the entity vector; the server preprocesses the entity vector; the server calculates the difference between the entity vector pairs For the initial similarity, the entity vector pair is any two entity vectors; the server sets the entity representation vector pair whose initial similarity exceeds a preset threshold as the entity pair to be aligned.
  • the server preprocesses the entity representation vector, filters out all the word representation vectors with a high inverse text frequency index IDF value in all the entity representation vectors, and deletes words with a low IDF value. Specifically, the server calculates the IDF value of the inverse text frequency index of all the word representation vectors in the two knowledge graphs to be fused; the server deletes the word representation vector in the entity representation vector whose IDF value is lower than the preset IDF value.
  • the IDF value is a measure of the universality and importance of a word. If the IDF value of the word is low, it means that the word appears more frequently.
  • the word representation vector In the entity representation vector, it is often the word representation vector that does not need to be distinguished, such as the pause word " ⁇ " and " ⁇ " means a vector.
  • the deletion of the word representation vector with a low IDF value by the server helps to improve the efficiency of entity representation vector matching.
  • the server calculates the initial similarity between the entity representation vector pair after preprocessing, and the entity pair is any two entities. Specifically, after the server determines the same word vector in the target entity vector pair, it uses a preset initial similarity calculation formula to calculate the initial similarity of the target entity vector pair.
  • the preset initial similarity calculation formula is:
  • e 1 and e 2 represent the target entity vector pair
  • represents the same word vector in the target entity vector pair
  • idf 1 ( ⁇ ) represents the IDF value of ⁇ in e 1
  • idf 2 ( ⁇ ) represents ⁇ in e 2 IDF value in.
  • V 1 represents all entities in the first knowledge graph to be fused
  • E 1 represents the first knowledge graph to be fused
  • V 2 represents all entities in the second knowledge graph to be fused
  • E 2 represents all the triples of the second knowledge graph to be fused.
  • the server obtains the graphs G 1 (V 1 , E 1 ) and G 2 (V 2 , E 2 ) of the two knowledge graphs to be fused, V 1 represents all entities in the first knowledge graph to be fused, and E 1 represents the first For all triples of the knowledge graph to be fused, V 2 represents all entities in the second knowledge graph to be fused, and E 2 represents all triples of the second knowledge graph to be fused.
  • the server creates two paired graphs G 3 of the knowledge graph to be fused, and adds initial nodes (a s , b s ) representing the target entity pair to be aligned in the paired graph G 3 , and the target entity pair a s and b s are either one The pair of entities to be aligned.
  • Tang Monk and “Tang San Zang” are a pair of aligned entities
  • the entity vector of "Tang Monk” is a
  • the entity vector of "Tang San Zang” is b
  • “Tang Monk” is the entity in the first to-be-fused knowledge map
  • “Tang San Zang” is an entity in the second-band fusion knowledge graph, then in the pairing graph G 3 , the entity to be aligned to the node (a, b), (a, b) is the initial node.
  • the server adds other nodes in the paired graph G 3 and undirected edges connecting the nodes based on the initial node (a s , b s ), graph G 1 (V 1 , E 1 ) and graph G 2 (V 2 , E 2 ).
  • the process of adding other nodes and undirected edges to the server is as follows: when the triplet (a s ,p,a n ) ⁇ E 1 , the triplet (b s ,p,b n ) ⁇ E 2 , a n ⁇ V 1, b n ⁇ V 2, and entity type and a n b n are the same, G 3 in FIG pairing added node (a n, b n), and a connection (a s, b s) and neighbor entity (a n, b n) undirected edges, (a n, b n) is (a s, b s) of the node; if the triple (a n, p, c n ) ⁇ E 1, triplet (b n, p, d n ) ⁇ E 2, c n ⁇ V 1, d n ⁇ V 2, and the same entity type c n and D n, the 3 nodes increases (FI
  • the relationship p in all triples in this example is any relationship in the first knowledge graph to be fused and/or the second knowledge graph to be fused.
  • Tang Monk and “Tang San Zang” are a pair of aligned entities
  • the entity vector of "Tang Monk” is a
  • the entity vector of "Tang San Zang” is b
  • “Tang Monk” is the entity in the first to-be-fused knowledge map
  • "Tang San Zang” is an entity in the second zone of the fusion knowledge map.
  • In the first to-be-fused knowledge map there are triples (Tang Seng, yes, monk), and in the second to-be-fused knowledge map, there are triples (Tang Sanzang, yes, Master), the entity vector of "monk” is c
  • the entity vector of "Master” is d.
  • the entity types of "monk” and “Master” are the same, and both represent one identity, then in the pairing diagram G 3 , Indicating the initial node (a, b) of the entity pair to be aligned, and (a, b) the neighbor entity pair node (c, d), the node (a, b) and the node (c, d) are connected by an undirected edge .
  • conflicting neighbor entity pair nodes are multiple neighbor entity pair nodes that contain the same entity. For example, entity p1 appears in neighbor entity pair nodes (p1, p3), (p1, p4) at the same time, and we regard these two nodes as conflicts The neighboring entity to the node.
  • the server calculates the similarity value of the entity pair to be aligned through the PR algorithm.
  • the PR algorithm calculates the static distribution of nodes based on random walks, calculates the initial similarity value of each node in the pairing graph, and then propagates through the undirected edge to obtain the similarity value between the two nodes.
  • the source node u and the landing node v can be calculated by the PR algorithm to obtain the PPR value PPR (v ⁇ u) from the source node u to the landing node v.
  • the PPR value refers to the static state based on one node wandering to other nodes in a graph Distribution probability, based on the PPR value and the initial similarity value, the similarity value between the source node u and the landing node v can be obtained.
  • the greater the PPR value the greater the probability of walking from the source node to the landing node v.
  • the PR algorithm can determine whether the entities to be matched are similar in combination with the neighbor entity pair information of the entities to be aligned, which improves the accuracy of the server in determining whether the entities are similar.
  • the specific process of the server calculating the similarity value is: the server determines the shared original words w n in the two knowledge graphs to be fused, the shared original word refers to the words that have been determined to have the same meaning in the two knowledge graphs to be fused, and then determines each shared original word w n G in FIG.
  • the source node 1 is similar, starting to be aligned entity calculates similarity value from the source node, help to improve the Calculate the accuracy of the similarity value of the entity to be aligned; then, the server calculates the similarity value of the entity to be aligned using the preset similarity value calculation formula.
  • the preset similarity value calculation formula is:
  • e 3 and e 4 represent an entity pair to be aligned
  • deg(w n ) represents the number of neighbor entity pairs that share the original word w n degree(w n )
  • PPR(W i n ⁇ e 3 ,e 4 > ) Represents the PPR value from node W i n to node ⁇ e 3 , e 4 > in the pairing graph G 3.
  • the server uses the sum of the PPR values of each source node in the pairing graph to travel to the node to be aligned as the similarity value of the entity to be aligned, and can combine the triple data information related to the aligned entity pair to further improve the credibility of the similarity value Degree to improve the accuracy of entity alignment.
  • the server performs entity alignment according to the similarity value of the entity pair to be aligned. Specifically, the server determines whether the similarity value of the entity pair to be aligned exceeds the preset similarity value; if the similarity value of the target entity pair to be aligned exceeds the preset similarity value, and the target entity pair to be aligned is any entity pair to be aligned, the server will The target entity pair to be aligned is marked as an entity pair that expresses the same meaning; if the similarity value of the target entity pair to be aligned does not exceed a preset threshold, the server determines that the target entity pair to be aligned has different meanings.
  • the server determines that the two entities in a pair of entities to be aligned have the same meaning, the server marks or associates the entity vectors of the two entities so that the server can determine that the two entities have the same meaning in the fused knowledge graph Entity.
  • the similarity value between entities is calculated by the PR algorithm, and the entity alignment is performed based on the similarity value of the entities, which improves the efficiency of entity alignment and the accuracy of entity alignment.
  • the entity alignment method of the knowledge graph in the embodiment of the application is described above, and the entity alignment device of the knowledge graph in the embodiment of the application is described below. Please refer to FIG. 3, an implementation of the entity alignment device of the knowledge graph in the embodiment of the application. Examples include:
  • the obtaining unit 301 is used to obtain two entities of the knowledge graph to be fused;
  • the first generating unit 302 is configured to generate a pair of entities to be aligned according to the entities of the two knowledge graphs to be fused;
  • the second generating unit 303 is configured to generate a paired graph of the two knowledge graphs to be fused based on the entity pair to be aligned;
  • the calculation unit 304 calculates the similarity value of the entity pair to be aligned by using a webpage ranking PR algorithm
  • the alignment unit 305 is configured to perform entity alignment according to the similarity values of the entity pairs to be aligned.
  • the embodiment of the application calculates the similarity value between entities through the PR algorithm, and performs entity alignment based on the similarity value of the entities, which improves the efficiency of entity alignment and the accuracy of entity alignment.
  • an embodiment of the entity alignment device of the knowledge graph in the embodiment of the present application includes:
  • the obtaining unit 301 is used to obtain two entities of the knowledge graph to be fused;
  • the first generating unit 302 is configured to generate a pair of entities to be aligned according to the entities of the two knowledge graphs to be fused;
  • the second generating unit 303 is configured to generate a paired graph of the two knowledge graphs to be fused based on the entity pair to be aligned;
  • the calculation unit 304 calculates the similarity value of the entity pair to be aligned by using a webpage ranking PR algorithm
  • the alignment unit 305 is configured to perform entity alignment according to the similarity values of the entity pairs to be aligned.
  • the second generating unit 303 specifically includes:
  • the obtaining module 3031 is used to obtain the graph G 1 (V 1 , E 1 ) and the graph G 2 (V 2 , E 2 ) of the two knowledge graphs to be fused, V 1 represents all of the first knowledge graphs to be fused Entities, E 1 represents all triples of the first knowledge graph to be fused, V 2 represents all entities in the second knowledge graph to be fused, and E 2 represents all triples of the second knowledge graph to be fused;
  • the creation module 3032 is configured to create a pairing graph G 3 of the two knowledge graphs to be fused, and adding an initial node (a s , b s ) representing a target entity pair to be aligned in the pairing graph G 3.
  • the target entity pair a s and b s are any entity pair to be aligned;
  • An adding module 3033 is used to add other items in the pairing graph G 3 based on the initial node (a s , b s ), graph G 1 (V 1 , E 1 ), and graph G 2 (V 2 , E 2) Nodes and undirected edges connecting nodes.
  • the conversion module adding module 3033 is specifically used for:
  • the entity i and n type n and m are the same, repeated increases in node 3 (i n, m G n FIG pairing ), and the undirected edge connecting (h n ,l n ) and (i n ,m n ) until all the triples in E 1 and E 2 are traversed; determine the initial node (a s ,b s )
  • the conflicting neighbor entity pair node, the conflicting neighbor entity pair node is a plurality of neighbor entity pair nodes including one same entity; among the conflicting neighbor entity pair nodes, the one with the largest initial similarity is retained
  • calculation unit 304 is specifically configured to:
  • e 3 and e 4 represent an entity pair to be aligned, deg(w n ) represents the number of neighbor entity pair nodes sharing the original word w n degree(w n ), PPR(W i n ⁇ e 3 , e 4 >) represents the PPR value from the node W i n to the node ⁇ e 3 , e 4 > in the pairing graph G 3.
  • the first generating unit 302 specifically includes:
  • the conversion module 3021 is configured to perform vector conversion on the entities of the two knowledge graphs to be fused through a bag-of-words model to obtain an entity vector;
  • the processing module 3022 is configured to preprocess the entity vector
  • the calculation module 3023 is configured to calculate the initial similarity between a pair of entity vectors, where the entity vector pair is any two entity vectors;
  • the setting module 3024 is configured to set the entity representation vector pair whose initial similarity exceeds a preset threshold as the entity pair to be aligned.
  • processing module 3022 is specifically used for:
  • calculation module 3023 is specifically used for:
  • e 1 and e 2 represent the target entity vector pair
  • represents the same word vector in the target entity vector pair
  • idf 1 ( ⁇ ) represents the IDF value of ⁇ in e 1
  • idf 2 ( ⁇ ) Represents the IDF value of ⁇ in e 2.
  • the similarity value between entities is calculated by the PR algorithm, and the entity alignment is performed based on the similarity value of the entities, which improves the efficiency of entity alignment and the accuracy of entity alignment.
  • FIGS. 3 to 4 describe in detail the entity alignment device of the knowledge graph in the embodiment of the present application from the perspective of modular functional entities, and the following describes the entity alignment device of the knowledge graph in the embodiment of the present application in detail from the perspective of hardware processing.
  • FIG. 5 is a schematic structural diagram of an entity alignment device for a knowledge graph provided by an embodiment of the present application.
  • the entity alignment device 500 of the knowledge graph may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units, CPU) 501 (for example, one or more processors) and memory 509, and one or more storage media 508 (for example, one or more storage devices with a large amount of data) storing application programs 507 or data 506.
  • the memory 509 and the storage medium 508 may be short-term storage or persistent storage.
  • the program stored in the storage medium 508 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the physical alignment device of the knowledge graph.
  • the processor 501 may be configured to communicate with the storage medium 508, and execute a series of instruction operations in the storage medium 508 on the entity alignment device 500 of the knowledge graph.
  • the physical alignment device 500 of the knowledge graph may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input and output interfaces 504, and/or one or more operating systems 505, for example Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the processor 501 can perform the functions of the acquisition unit 301, the first generation unit 302, the second generation unit 303, the calculation unit 304, and the alignment unit 305 in the foregoing embodiment.
  • the processor 501 is the control center of the entity alignment device of the knowledge graph, and can perform processing according to the set entity alignment method of the knowledge graph.
  • the processor 501 uses various interfaces and lines to connect the various parts of the entity alignment device of the entire knowledge graph, and executes knowledge by running or executing software programs and/or modules stored in the memory 509, and calling data stored in the memory 509
  • the entity of the map aligns the various functions and processing data of the device, so as to realize the entity alignment of the knowledge map.
  • the storage medium 508 and the storage 509 are both carriers for storing data.
  • the storage medium 508 may refer to an internal storage with a small storage capacity but a fast speed, and the storage 509 may be an external storage with a large storage capacity but a slow storage speed.
  • the memory 509 may be used to store software programs and modules.
  • the processor 501 executes various functional applications and data processing of the entity alignment device 500 of the knowledge graph by running the software programs and modules stored in the memory 509.
  • the memory 509 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program required by at least one function (an application program for calculating similar values, etc.), etc.; the storage data area may store a knowledge map
  • the entity aligns the data created by the use of the device (such as the initial node) and so on.
  • the memory 509 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • a non-volatile memory such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, twisted pair) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, an optical disc), or a semiconductor medium (for example, a solid state disk (SSD)).
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及大数据技术领域,公开了知识图谱的实体对齐方法、装置、设备及存储介质,通过PR算法计算实体之间的相似值,基于实体的相似值进行实体对齐,提高了实体对齐的效率以及实体对齐的准确度。本申请方法包括:获取两个待融合知识图谱的实体;根据两个待融合知识图谱的实体,生成待对齐实体对;基于待对齐实体对,生成两个待融合知识图谱的配对图;基于配对图,通过PR算法计算待对齐实体对的相似值;根据待对齐实体对的相似值进行实体对齐。

Description

知识图谱的实体对齐方法、装置、设备及存储介质
本申请要求于2019年10月30日提交中国专利局、申请号为201911043138.9、发明名称为“知识图谱的实体对齐方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及知识关系分析领域,尤其涉及知识图谱的实体对齐方法、装置、设备及存储介质。
背景技术
知识图谱技术日益成为人工智能的基础,它是机器理解自然语言和构建知识网络的重要方法。为了构建一个完备的法律知识体系,需要融合来自各领域的知识,在融合多个知识图谱的过程中,一个重要的步骤便是实体对齐。实体对齐的问题来源于在不同的知识图谱中会有相同实体,但是相同实体可能具有不同的表达,例如“唐僧”和“唐三奘”,又或者在不同的知识图谱中,相同表达的实体表示的含义不同,例如“我喜欢吃苹果”和“苹果的创始人是乔布斯”这两句话中都有“苹果”,但“苹果”表示不同的含义。
目前知识图谱的实体对齐方法侧重于对不同知识图谱中的实体对进行建模,研究实体的概率分布,但这种方法局限于小规模的数据,并且依赖先验分布。也有人提出了运用贪心算法的实体对齐方法,但这种方法不能区分出具有相同名称却不同意义的实体,发明人意识到实体对齐的效率低,不利于多个知识图谱的融合。
发明内容
本申请提供一种知识图谱的实体对齐方法,通过PR算法(PageRank,网页排名)计算实体之间的相似值,基于实体的相似值进行实体对齐,提高了实体对齐的效率以及实体对齐的准确度。
本申请实施例的第一方面提供一种知识图谱的实体对齐方法,包括:获取两个待融合知识图谱的实体;根据所述两个待融合知识图谱的实体,生成待对齐实体对;基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;基于所述配对图,通过PR算法计算所述待对齐实体对的相似值;根据所述待对齐实体对的相似值进行实体对齐。
本申请实施例的第二方面提供了一种知识图谱的实体对齐装置,包括:获取单元,用于获取两个待融合知识图谱的实体;第一生成单元,用于根据所述两个待融合知识图谱的实体,生成待对齐实体对;第二生成单元,用于基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;计算单元,用于基于所述配对图,通过PR算法计算所述待对齐实体对的相似值;对齐单元,用 于根据所述待对齐实体对的相似值进行实体对齐。
本申请实施例的第三方面提供了一种知识图谱的实体对齐设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任一实施方式所述的知识图谱的实体对齐方法。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有指令,当所述指令在计算机上运行时,使得计算机执行上述任一实施方式所述的知识图谱的实体对齐方法的步骤。
本申请实施例提供的技术方案中,获取两个待融合知识图谱的实体;根据所述两个待融合知识图谱的实体,生成待对齐实体对;基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;基于所述配对图,通过网页排名PR算法计算所述待对齐实体对的相似值;根据所述待对齐实体对的相似值进行实体对齐。本申请实施例,通过PR算法计算实体之间的相似值,基于实体的相似值进行实体对齐,提高了实体对齐的效率以及实体对齐的准确度。
附图说明
图1为本申请实施例中知识图谱的实体对齐方法的一个实施例示意图;
图2为本申请实施例中知识图谱的实体对齐方法的另一个实施例示意图;
图3为本申请实施例中知识图谱的实体对齐装置的一个实施例示意图;
图4为本申请实施例中知识图谱的实体对齐装置的另一个实施例示意图;
图5为本申请实施例中知识图谱的实体对齐设备的一个实施例示意图。
具体实施方式
本申请实施例提供了一种知识图谱的实体对齐方法、装置、设备及存储介质,通过PR算法计算实体之间的相似值,基于实体的相似值进行实体对齐,提高了实体对齐的效率以及实体对齐的准确度。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
请参阅图1,本申请实施例提供的一种知识图谱的实体对齐方法的流程图,具体包括:
101、获取两个待融合知识图谱的实体。
服务器获取两个待融合知识图谱的实体,两个待融合知识图谱为进行融合的任意两个知识图谱。
需要说明的是,一个完备的知识图谱需要融合多个知识图谱,将表达含义相同的实体进行归类,在不同的知识图谱中,相同的实体表达的含义可能不相同,例如实体“苹果”,在知识图谱A中表示水果,在知识图谱B中表示电子产品;不同的实体在不同的知识图谱中表达的含义可能相同,例如知识图谱C中的实体“猫王”与知识图谱D中的实体“Elvis Presley”表示相同的含义。
102、根据两个待融合知识图谱的实体,生成待对齐实体对。
服务器基于两个待融合知识图谱的实体生成待对齐实体对。服务器对两个带融合知识图谱的实体进行转换并进行预处理后,计算所有实体对的初始相似度,将初始相似度超过预置阈值的实体对设置为带匹配实体对。
其中,生成待对齐实体对的具体过程为:服务器通过词袋模型对两个待融合知识图谱的实体进行向量转换,得到实体向量;服务器对实体向量进行预处理;服务器计算实体向量对之间的初始相似度,实体向量对为任意两个实体向量;服务器将初始相似度超过预置阈值的实体表示向量对设置为待对齐实体对。
服务器对实体表示向量进行预处理,筛选出所有实体表示向量中逆文本频率指数(inverse document frequency,IDF)值高的单词表示向量,删除IDF值低的单词。具体的,服务器计算两个待融合知识图谱中所有单词表示向量的逆文本频率指数IDF值;服务器删除实体表示向量中IDF值低于预置IDF值的单词表示向量。IDF值是一个词语的普遍性和重要性的度量,如果词语的IDF值低,表示词语出现的频率高,在实体表示向量中往往为无需进行区分的单词表示向量,如停顿词“的”和“了”的表示向量。服务器删除IDF值低的单词表示向量有利于提高实体表示向量匹配的效率。
服务器基于IDF值,计算预处理后实体表示向量对之间的初始相似度,实体对为任意两个实体。具体的,服务器确定目标实体向量对中相同的单词向量之后,通过预置的初始相似度计算公式,计算目标实体向量对的初始相似度,预置的初始相似度计算公式为:
Figure PCTCN2019119307-appb-000001
其中,e 1和e 2表示目标实体向量对,ω表示目标实体向量对中相同的单词向量,idf 1(ω)表示ω在e 1中的IDF值,idf 2(ω)表示ω在e 2中的IDF值。
103、基于待对齐实体对,生成两个待融合知识图谱的配对图。
服务器基于待匹配的实体对生成两个待融合知识图谱的配对图,配对图包括待对齐实体对节点,以及基于两个待融合知识图谱中三元组数据生成的节点和连接节点的无向边。配对图中的节点为待对齐实体对节点,以及与待对齐实体对节点相关的实体对节点。
举例说明,“唐僧”和“唐三藏”为一个带对齐实体对,“唐僧”的实体向量为a,“唐三藏”的实体向量为b,“唐僧”是第一待融合知识图谱中的实体,“唐三藏”是第二带融合知识图谱中的实体,在第一待融合知识图谱中,有三元组(唐僧,是,和尚),在第二待融合知识图谱中,有三元组(唐三藏,是,师父),“和尚”的实体向量为c,“师父”的实体向量为d,则在配对图中,有待对齐实体对节点(a,b),以及(a,b)相关的实体对节点(c,d),需要说明的是,(c,d)相关的实体对节点也是初始节点(a,b)相关的实体对节点。
104、基于配对图,通过PR算法计算待对齐实体对的相似值。
服务器基于配对图,通过网页排名(page rank,PR)算法计算待对齐实体对的相似值。PR算法基于随机游走计算节点的静态分布,计算配对图中每个节点的初始相似值,然后通过无向边进行传播,得到两个节点间的相似值,对于配对图中任意两个节点,源节点u和降落节点v,通过PR算法可以计算得到从源节点u到降落节点v的PPR值PPR(v→u),PPR值是指在一个图中从一个节点游走至另外一个节点的静态分布概率,基于PPR值和初始相似值,可以得到源节点u和降落节点v之间的相似值。PPR值越大,表示从源节点游走至降落节点v的概率越大。通过PR算法能够结合待对齐实体对的邻居实体对信息确定待匹配实体是否相似,提高了服务器确定实体是否相似的准确性。
105、根据待对齐实体对的相似值进行实体对齐。
服务器根据待对齐实体对的相似值进行实体对齐。具体的,服务器判断待对齐实体对的相似值是否超过预置相似值;若目标待对齐实体对的相似值超过预置相似值,目标待对齐实体对为任意一个待对齐实体对,则服务器将目标待对齐实体对标记为表达含义相同的实体对;若目标待对齐实体对的相似值未超过预设阈值,则服务器确定目标待对齐实体对表示的含义不同。
可以理解的是,当服务器确定一个待对齐实体对中两个实体表示的含义相同时,服务器通过标记或者关联两个实体的实体向量,以便在融合的知识图谱中确定两个实体为含义相同的实体。
本申请实施例,通过PR算法计算实体之间的相似值,基于实体的相似值进行实体对齐,提高了实体对齐的效率以及实体对齐的准确度。
请参阅图2,本申请实施例中知识图谱的实体对齐方法的另一个实施例包括:
201、获取两个待融合知识图谱的实体。
服务器获取两个待融合知识图谱的实体,两个待融合知识图谱为进行融合的任意两个知识图谱。
需要说明的是,一个完备的知识图谱需要融合多个知识图谱,将表达含义相同的实体进行归类,在不同的知识图谱中,相同的实体表达的含义可能不相同,例如实体“苹果”,在知识图谱A中表示水果,在知识图谱B中表示电子产品;不同的实体在不同的知识图谱中表达的含义可能相同,例如知识图谱C 中的实体“猫王”与知识图谱D中的实体“Elvis Presley”表示相同的含义。
202、根据两个待融合知识图谱的实体,生成待对齐实体对。
服务器基于两个待融合知识图谱的实体生成待对齐实体对。服务器对两个带融合知识图谱的实体进行转换并进行预处理后,计算所有实体对的初始相似度,将初始相似度超过预置阈值的实体对设置为带匹配实体对。
其中,生成待对齐实体对的具体过程为:服务器通过词袋模型对两个待融合知识图谱的实体进行向量转换,得到实体向量;服务器对实体向量进行预处理;服务器计算实体向量对之间的初始相似度,实体向量对为任意两个实体向量;服务器将初始相似度超过预置阈值的实体表示向量对设置为待对齐实体对。
服务器对实体表示向量进行预处理,筛选出所有实体表示向量中逆文本频率指数IDF值高的单词表示向量,删除IDF值低的单词。具体的,服务器计算两个待融合知识图谱中所有单词表示向量的逆文本频率指数IDF值;服务器删除实体表示向量中IDF值低于预置IDF值的单词表示向量。IDF值是一个词语的普遍性和重要性的度量,如果词语的IDF值低,表示词语出现的频率高,在实体表示向量中往往为无需进行区分的单词表示向量,如停顿词“的”和“了”的表示向量。服务器删除IDF值低的单词表示向量有利于提高实体表示向量匹配的效率。
服务器基于IDF值,计算预处理后实体表示向量对之间的初始相似度,实体对为任意两个实体。具体的,服务器确定目标实体向量对中相同的单词向量之后,通过预置的初始相似度计算公式,计算目标实体向量对的初始相似度,预置的初始相似度计算公式为:
Figure PCTCN2019119307-appb-000002
其中,e 1和e 2表示目标实体向量对,ω表示目标实体向量对中相同的单词向量,idf 1(ω)表示ω在e 1中的IDF值,idf 2(ω)表示ω在e 2中的IDF值。
203、获取两个待融合知识图谱的图G 1(V 1,E 1)和图G 2(V 2,E 2),V 1表示第一待融合知识图谱中的所有实体,E 1表示第一待融合知识图谱的所有三元组,V 2表示第二待融合知识图谱中的所有实体,E 2表示第二待融合知识图谱的所有三元组。
服务器获取两个待融合知识图谱的图G 1(V 1,E 1)和图G 2(V 2,E 2),V 1表示第一待融合知识图谱中的所有实体,E 1表示第一待融合知识图谱的所有三元组,V 2表示第二待融合知识图谱中的所有实体,E 2表示第二待融合知识图谱的所有三元组。
204、创建两个待融合知识图谱的配对图G 3,并在配对图G 3中增加表示目标待对齐实体对的初始节点(a s,b s),目标实体对a s和b s为任意一个待对齐实 体对。
服务器创建两个待融合知识图谱的配对图G 3,并在配对图G 3中增加表示目标待对齐实体对的初始节点(a s,b s),目标实体对a s和b s为任意一个待对齐实体对。
举例说明,“唐僧”和“唐三藏”为一个带对齐实体对,“唐僧”的实体向量为a,“唐三藏”的实体向量为b,“唐僧”是第一待融合知识图谱中的实体,“唐三藏”是第二带融合知识图谱中的实体,则在配对图中G 3,有待对齐实体对节点(a,b),(a,b)为初始节点。
205、基于初始节点(a s,b s)、图G 1(V 1,E 1)和图G 2(V 2,E 2)增加配对图G 3中的其他节点以及连接节点的无向边。
服务器基于初始节点(a s,b s)、图G 1(V 1,E 1)和图G 2(V 2,E 2)增加配对图G 3中的其他节点以及连接节点的无向边。除初始节点外,服务器增加其他节点和无向边的过程为,当三元组(a s,p,a n)∈E 1,三元组(b s,p,b n)∈E 2,a n∈V 1,b n∈V 2,且实体a n和b n的类型相同时,在配对图G 3中增加节点(a n,b n),以及连接(a s,b s)和(a n,b n)的无向边,(a n,b n)为(a s,b s)的邻居实体对节点;当三元组(a n,p,c n)∈E 1,三元组(b n,p,d n)∈E 2,c n∈V 1,d n∈V 2,且实体c n和d n的类型相同,在配对图G 3中增加节点(c n,d n),以及连接(a n,b n)和(c n,d n)的无向边;当三元组(h n,p,i n)∈E 1,三元组(l n,p,m n)∈E 2,节点(h n,l n)属于配对图G 3,且实体i n和m n的类型相同时,重复在配对图G 3中增加节点(i n,m n),以及连接(h n,l n)和(i n,m n)的无向边,直至遍历E 1和E 2中所有的三元组。
需要说明的是,本实例中所有三元组中的关系p为第一待融合知识图谱和/或第二待融合知识图谱中任意一个关系。在本申请中出现的三元组(a s,p,a n)∈E 1、(b s,p,b n)∈E 2、(a n,p,c n)∈E 1、(b n,p,d n)∈E 2、(h n,p,i n)∈E 1、(l n,p,m n)∈E 2中的p指代的关系可以相同,也可以不同。
举例说明,“唐僧”和“唐三藏”为一个带对齐实体对,“唐僧”的实体向量为a,“唐三藏”的实体向量为b,“唐僧”是第一待融合知识图谱中的实体,“唐三藏”是第二带融合知识图谱中的实体,在第一待融合知识图谱中,有三元组(唐僧,是,和尚),在第二待融合知识图谱中,有三元组(唐三藏,是,师父),“和尚”的实体向量为c,“师父”的实体向量为d,“和尚”和“师父”的实体类型相同,都表示一种身份,则在配对图G 3中,有表示待对齐实体对的初始节点(a,b),以及(a,b)邻居实体对节点(c,d),节点(a,b)和节点(c,d)通过无向边连接。
需要说明的是,为了降低配对图G 3的复杂度,在冲突的邻居实体对节点中,服务器保留初始相似度最大的邻居实体对,删除剩余冲突的邻居实体对节点以及进行连接的无向边。冲突的邻居实体对节点为包含一个相同实体的多个邻居 实体对节点,例如实体p1同时在邻居实体对节点(p1,p3)、(p1,p4)中出现,我们将这两个节点作为冲突的邻居实体对节点。
206、基于配对图G 3,通过PR算法计算待对齐实体对的相似值。
服务器基于配对图G 3,通过PR算法计算待对齐实体对的相似值。PR算法基于随机游走计算节点的静态分布,计算配对图中每个节点的初始相似值,然后通过无向边进行传播,得到两个节点间的相似值,对于配对图中任意两个节点,源节点u和降落节点v,通过PR算法可以计算得到从源节点u到降落节点v的PPR值PPR(v→u),PPR值是指在一个图中基于一个节点游走至其他节点的静态分布概率,基于PPR值和初始相似值,可以得到源节点u和降落节点v之间的相似值。PPR值越大,表示从源节点游走至降落节点v的概率越大。通过PR算法能够结合待对齐实体对的邻居实体对信息确定待匹配实体是否相似,提高了服务器确定实体是否相似的准确性。
服务器具体计算相似值得过程为:服务器确定两个待融合知识图谱中的共享原始词w n,共享原始词指两个待融合知识图谱中已确定含义相同的单词,再确定每个共享原始词w n在配对图G 3中的邻居实体对节点W i n,并统计每个w n的邻居实体对节点数量degree(w n),例如共享原始词“电影”在两个待融合知识图谱中表示相同的含义,在配对图G 3中存在节点(电影,电影),该节点被称为源节点,源节点的相似值为1,从源节点出发计算待对齐实体对的相似值,有利于提高计算待对齐实体对相似值的准确性;之后,服务器通过预置的相似值计算公式计算待对齐实体对的相似值,预置的相似值计算公式为:
Figure PCTCN2019119307-appb-000003
其中,e 3和e 4表示一个待对齐实体对,deg(w n)表示共享原始词w n的邻居实体对节点数量degree(w n),PPR(W i n→<e 3,e 4>)表示在配对图G 3中从节点W i n至节点<e 3,e 4>的PPR值。
服务器通过将配对图中每个源节点游走至待对齐节点的PPR值之和作为待对齐实体的相似值,可以结合与带对齐实体对相关的三元组数据信息进一步提高相似值的可信度,提高实体对齐的准确性。
207、根据待对齐实体对的相似值进行实体对齐。
服务器根据待对齐实体对的相似值进行实体对齐。具体的,服务器判断待对齐实体对的相似值是否超过预置相似值;若目标待对齐实体对的相似值超过预置相似值,目标待对齐实体对为任意一个待对齐实体对,则服务器将目标待对齐实体对标记为表达含义相同的实体对;若目标待对齐实体对的相似值未超过预设阈值,则服务器确定目标待对齐实体对表示的含义不同。
可以理解的是,当服务器确定一个待对齐实体对中两个实体表示的含义相同时,服务器通过标记或者关联两个实体的实体向量,以便服务器在融合的知 识图谱中确定两个实体为含义相同的实体。
本申请实施例,通过PR算法计算实体之间的相似值,基于实体的相似值进行实体对齐,提高了实体对齐的效率以及实体对齐的准确度。
上面对本申请实施例中知识图谱的实体对齐方法进行了描述,下面对本申请实施例中知识图谱的实体对齐装置进行描述,请参阅图3,本申请实施例中知识图谱的实体对齐装置的一个实施例包括:
获取单元301,用于获取两个待融合知识图谱的实体;
第一生成单元302,用于根据所述两个待融合知识图谱的实体,生成待对齐实体对;
第二生成单元303,用于基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;
计算单元304,基于所述配对图,通过网页排名PR算法计算所述待对齐实体对的相似值;
对齐单元305,用于根据所述待对齐实体对的相似值进行实体对齐。
本申请实施例,本申请实施例,通过PR算法计算实体之间的相似值,基于实体的相似值进行实体对齐,提高了实体对齐的效率以及实体对齐的准确度。
请参阅图4,本申请实施例中知识图谱的实体对齐装置的一个实施例包括:
获取单元301,用于获取两个待融合知识图谱的实体;
第一生成单元302,用于根据所述两个待融合知识图谱的实体,生成待对齐实体对;
第二生成单元303,用于基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;
计算单元304,基于所述配对图,通过网页排名PR算法计算所述待对齐实体对的相似值;
对齐单元305,用于根据所述待对齐实体对的相似值进行实体对齐。
可选的,第二生成单元303具体包括:
获取模块3031,用于获取所述两个待融合知识图谱的图G 1(V 1,E 1)和图G 2(V 2,E 2),V 1表示第一待融合知识图谱中的所有实体,E 1表示所述第一待融合知识图谱的所有三元组,V 2表示第二待融合知识图谱中的所有实体,E 2表示所述第二待融合知识图谱的所有三元组;
创建模块3032,用于创建所述两个待融合知识图谱的配对图G 3,并在所述配对图G 3中增加表示目标待对齐实体对的初始节点(a s,b s),所述目标实体对a s和b s为任意一个待对齐实体对;
增加模块3033,用于基于所述初始节点(a s,b s)、图G 1(V 1,E 1)和图G 2(V 2,E 2) 增加所述配对图G 3中的其他节点以及连接节点的无向边。
可选的,转换模块增加模块3033具体用于:
当三元组(a s,p,a n)∈E 1,三元组(b s,p,b n)∈E 2,a n∈V 1,b n∈V 2,p为E 1和/或E 2中的关系,且实体a n和b n的类型相同时在所述配对图G 3中增加节点(a n,b n),以及连接(a s,b s)和(a n,b n)的无向边,(a n,b n)为(a s,b s)的邻居实体对节点;当三元组(a n,p,c n)∈E 1,三元组(b n,p,d n)∈E 2,c n∈V 1,d n∈V 2,且实体c n和d n的类型相同,在所述配对图G 3中增加节点(c n,d n),以及连接(a n,b n)和(c n,d n)的无向边;当三元组(h n,p,i n)∈E 1,三元组(l n,p,m n)∈E 2,节点(h n,l n)属于配对图G 3,且实体i n和m n的类型相同时,重复在配对图G 3中增加节点(i n,m n),以及连接(h n,l n)和(i n,m n)的无向边,直至遍历E 1和E 2中所有的三元组;确定所述初始节点(a s,b s)冲突的所述邻居实体对节点,所述冲突的邻居实体对节点为包含一个相同实体的多个所述邻居实体对节点;在所述冲突的邻居实体对节点中,保留初始相似度最大的所述邻居实体对节点以及进行连接的无向边。
可选的,计算单元304具体用于:
确定所述两个待融合知识图谱中的共享原始词w n,所述共享原始词w n为所述两个待融合知识图谱中已确定含义相同的单词;确定每个所述共享原始词w n在所述配对图G 3中的邻居实体对节点W i n,并统计每个w n的邻居实体对节点数量degree(w n);通过预置的相似值计算公式计算待对齐实体对的相似值,所述预置的相似值计算公式为:
Figure PCTCN2019119307-appb-000004
其中,e 3和e 4表示一个所述待对齐实体对,deg(w n)表示所述共享原始词w n的邻居实体对节点数量degree(w n),PPR(W i n→<e 3,e 4>)表示在所述配对图G 3中从节点W i n至节点<e 3,e 4>的PPR值。
可选的,第一生成单元302具体包括:
转换模块3021,用于通过词袋模型对所述两个待融合知识图谱的实体进行向量转换,得到实体向量;
处理模块3022,用于对所述实体向量进行预处理;
计算模块3023,用于计算实体向量对之间的初始相似度,所述实体向量对为任意两个实体向量;
设置模块3024,用于将初始相似度超过预置阈值的实体表示向量对设置为待对齐实体对。
可选的,处理模块3022具体用于:
计算所述两个待融合知识图谱中所有单词向量的逆文本频率指数IDF值, 所述实体向量至少包括一个单词向量;删除所述实体向量中IDF值低于预置IDF值的所述单词向量。
可选的,计算模块3023具体用于:
确定目标实体向量对中相同的所述单词向量;通过预置的初始相似度计算公式,计算所述目标实体向量对的初始相似度,所述预置的初始相似度计算公式为:
Figure PCTCN2019119307-appb-000005
其中,e 1和e 2表示所述目标实体向量对,ω表示所述目标实体向量对中相同的所述单词向量,idf 1(ω)表示ω在e 1中的IDF值,idf 2(ω)表示ω在e 2中的IDF值。
本申请实施例,通过PR算法计算实体之间的相似值,基于实体的相似值进行实体对齐,提高了实体对齐的效率以及实体对齐的准确度。
上面图3至图4从模块化功能实体的角度对本申请实施例中的知识图谱的实体对齐装置进行详细描述,下面从硬件处理的角度对本申请实施例中知识图谱的实体对齐设备进行详细描述。
图5是本申请实施例提供的一种知识图谱的实体对齐设备的结构示意图,该知识图谱的实体对齐设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)501(例如,一个或一个以上处理器)和存储器509,一个或一个以上存储应用程序507或数据506的存储介质508(例如一个或一个以上海量存储设备)。其中,存储器509和存储介质508可以是短暂存储或持久存储。存储在存储介质508的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对知识图谱的实体对齐设备中的一系列指令操作。更进一步地,处理器501可以设置为与存储介质508通信,在知识图谱的实体对齐设备500上执行存储介质508中的一系列指令操作。
知识图谱的实体对齐设备500还可以包括一个或一个以上电源502,一个或一个以上有线或无线网络接口503,一个或一个以上输入输出接口504,和/或,一个或一个以上操作系统505,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5中示出的知识图谱的实体对齐设备结构并不构成对知识图谱的实体对齐设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。处理器501可以执行上述实施例中获取单元301、第一生成单元302、第二生成单元303、计算单元304和对齐单元305的功能。
下面结合图5对知识图谱的实体对齐设备的各个构成部件进行具体的介绍:
处理器501是知识图谱的实体对齐设备的控制中心,可以按照设置的知识 图谱的实体对齐方法进行处理。处理器501利用各种接口和线路连接整个知识图谱的实体对齐设备的各个部分,通过运行或执行存储在存储器509内的软件程序和/或模块,以及调用存储在存储器509内的数据,执行知识图谱的实体对齐设备的各种功能和处理数据,从而实现知识图谱的实体对齐。存储介质508和存储器509都是存储数据的载体,存储介质508可以是指储存容量较小,但速度快的内存储器,而存储器509可以是储存容量大,但储存速度慢的外存储器。
存储器509可用于存储软件程序以及模块,处理器501通过运行存储在存储器509的软件程序以及模块,从而执行知识图谱的实体对齐设备500的各种功能应用以及数据处理。存储器509可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(计算相似值的应用程序等)等;存储数据区可存储根据知识图谱的实体对齐设备的使用所创建的数据(比如初始节点)等。此外,存储器509可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在本申请实施例中提供的知识图谱的实体对齐方法程序和接收到的数据流存储在存储器中,当需要使用时,处理器501从存储器509中调用。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
获取两个待融合知识图谱的实体;
根据所述两个待融合知识图谱的实体,生成待对齐实体对;
基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;
基于所述配对图,通过网页排名PR算法计算所述待对齐实体对的相似值;
根据所述待对齐实体对的相似值进行实体对齐。
在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、双绞线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,光盘)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种知识图谱的实体对齐方法,其特征在于,包括:
    获取两个待融合知识图谱的实体;
    根据所述两个待融合知识图谱的实体,生成待对齐实体对;
    基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;
    基于所述配对图,通过网页排名PR算法计算所述待对齐实体对的相似值;
    根据所述待对齐实体对的相似值进行实体对齐。
  2. 根据权利要求1所述的知识图谱的实体对齐方法,其特征在于,所述基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图,包括:
    获取所述两个待融合知识图谱的图G 1(V 1,E 1)和图G 2(V 2,E 2),V 1表示第一待融合知识图谱中的所有实体,E 1表示所述第一待融合知识图谱的所有三元组,V 2表示第二待融合知识图谱中的所有实体,E 2表示所述第二待融合知识图谱的所有三元组;
    创建所述两个待融合知识图谱的配对图G 3,并在所述配对图G 3中增加表示目标待对齐实体对的初始节点(a s,b s),所述目标实体对a s和b s为任意一个待对齐实体对;
    基于所述初始节点(a s,b s)、图G 1(V 1,E 1)和图G 2(V 2,E 2)增加所述配对图G 3中的其他节点以及连接节点的无向边。
  3. 根据权利要求2所述的知识图谱的实体对齐方法,其特征在于,所述基于所述初始节点(a s,b s)、图G 1(V 1,E 1)和图G 2(V 2,E 2)增加所述配对图G 3中的其他节点以及连接节点的无向边,包括:
    当三元组(a s,p,a n)∈E 1,三元组(b s,p,b n)∈E 2,p为E 1和/或E 2中的关系,a n∈V 1,b n∈V 2,且实体a n和b n的类型相同时在所述配对图G 3中增加节点(a n,b n),以及连接(a s,b s)和(a n,b n)的无向边,(a n,b n)为(a s,b s)的邻居实体对节点;
    当三元组(a n,p,c n)∈E 1,三元组(b n,p,d n)∈E 2,c n∈V 1,d n∈V 2,且实体c n和d n的类型相同,在所述配对图G 3中增加节点(c n,d n),以及连接(a n,b n)和(c n,d n)的无向边;
    当三元组(h n,p,i n)∈E 1,三元组(l n,p,m n)∈E 2,节点(h n,l n)属于配对图G 3,且实体i n和m n的类型相同时,重复在配对图G 3中增加节点(i n,m n),以及连接(h n,l n)和(i n,m n)的无向边,直至遍历E 1和E 2中所有的三元组;
    确定所述初始节点(a s,b s)冲突的所述邻居实体对节点,所述冲突的邻居实体对节点为包含一个相同实体的多个所述邻居实体对节点;
    在所述冲突的邻居实体对节点中,保留初始相似度最大的所述邻居实体对节点以及进行连接的无向边。
  4. 根据权利要求3所述的知识图谱的实体对齐方法,其特征在于,所述基于所述配对图,通过PR算法计算所述待对齐实体对的相似值,包括:
    确定所述两个待融合知识图谱中的共享原始词w n,所述共享原始词w n为所述两个待融合知识图谱中已确定含义相同的单词;
    确定每个所述共享原始词w n在所述配对图G 3中的邻居实体对节点W i n,并统计每个w n的邻居实体对节点数量degree(w n);
    通过预置的相似值计算公式计算待对齐实体对的相似值,所述预置的相似值计算公式为:
    Figure PCTCN2019119307-appb-100001
    其中,e 3和e 4表示一个所述待对齐实体对,deg(w n)表示所述共享原始词w n的邻居实体对节点数量degree(w n),PPR(W i n→<e 3,e 4>)表示在所述配对图G 3中从节点W i n至节点<e 3,e 4>的PPR值。
  5. 根据权利要求1-4中任一所述的知识图谱的实体对齐方法,其特征在于,所述根据所述两个待融合知识图谱的实体,生成待对齐实体对,包括:
    通过词袋模型对所述两个待融合知识图谱的实体进行向量转换,得到实体向量;
    对所述实体向量进行预处理;
    计算实体向量对之间的初始相似度,所述实体向量对为任意两个实体向量;
    将初始相似度超过预置阈值的实体表示向量对设置为待对齐实体对。
  6. 根据权利要求5所述的知识图谱的实体对齐方法,其特征在于,所述对所述实体向量进行预处理,包括:
    计算所述两个待融合知识图谱中所有单词向量的逆文本频率指数IDF值,所述实体向量至少包括一个单词向量;
    删除所述实体向量中IDF值低于预置IDF值的所述单词向量。
  7. 根据权利要求6所述的知识图谱的实体对齐方法,其特征在于,所述计算实体向量对之间的初始相似度,所述实体向量对为任意两个实体向量,包括:
    确定目标实体向量对中相同的所述单词向量;
    通过预置的初始相似度计算公式,计算所述目标实体向量对的初始相似度,所述预置的初始相似度计算公式为:
    Figure PCTCN2019119307-appb-100002
    其中,e 1和e 2表示所述目标实体向量对,ω表示所述目标实体向量对中相同的所述单词向量,idf 1(ω)表示ω在e 1中的IDF值,idf 2(ω)表示ω在e 2中的IDF值。
  8. 一种知识图谱的实体对齐装置,其特征在于,包括:
    获取单元,用于获取两个待融合知识图谱的实体;
    第一生成单元,用于根据所述两个待融合知识图谱的实体,生成待对齐实体对;
    第二生成单元,用于基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;
    计算单元,基于所述配对图,通过网页排名PR算法计算所述待对齐实体对的相似值;
    对齐单元,用于根据所述待对齐实体对的相似值进行实体对齐。
  9. 根据权利要求8所述的知识图谱的实体对齐装置,所述第二生成单元包括:
    获取模块,用于获取所述两个待融合知识图谱的图G 1(V 1,E 1)和图G 2(V 2,E 2),V 1表示第一待融合知识图谱中的所有实体,E 1表示所述第一待融合知识图谱的所有三元组,V 2表示第二待融合知识图谱中的所有实体,E 2表示所述第二待融合知识图谱的所有三元组;
    创建模块,用于创建所述两个待融合知识图谱的配对图G 3,并在所述配对图G 3中增加表示目标待对齐实体对的初始节点(a s,b s),所述目标实体对a s和b s为任意一个待对齐实体对;
    增加模块,用于基于所述初始节点(a s,b s)、图G 1(V 1,E 1)和图G 2(V 2,E 2)增加所述配对图G 3中的其他节点以及连接节点的无向边。
  10. 根据权利要求9所述的知识图谱的实体对齐装置,所述增加模块具体用于:
    当三元组(a s,p,a n)∈E 1,三元组(b s,p,b n)∈E 2,p为E 1和/或E 2中的关系,a n∈V 1,b n∈V 2,且实体a n和b n的类型相同时在所述配对图G 3中增加节点(a n,b n),以及连接(a s,b s)和(a n,b n)的无向边,(a n,b n)为(a s,b s)的邻居实体对节点;
    当三元组(a n,p,c n)∈E 1,三元组(b n,p,d n)∈E 2,c n∈V 1,d n∈V 2,且实体c n和d n的类型相同,在所述配对图G 3中增加节点(c n,d n),以及连接(a n,b n)和(c n,d n)的无向边;
    当三元组(h n,p,i n)∈E 1,三元组(l n,p,m n)∈E 2,节点(h n,l n)属于配对图G 3,且实体i n和m n的类型相同时,重复在配对图G 3中增加节点(i n,m n),以及连接(h n,l n)和(i n,m n)的无向边,直至遍历E 1和E 2中所有的三元组;
    确定所述初始节点(a s,b s)冲突的所述邻居实体对节点,所述冲突的邻居实体对节点为包含一个相同实体的多个所述邻居实体对节点;
    在所述冲突的邻居实体对节点中,保留初始相似度最大的所述邻居实体对节点以及进行连接的无向边。
  11. 根据权利要求10所述的知识图谱的实体对齐装置,所述计算单元具体用于:
    确定所述两个待融合知识图谱中的共享原始词w n,所述共享原始词w n为所述两个待融合知识图谱中已确定含义相同的单词;
    确定每个所述共享原始词w n在所述配对图G 3中的邻居实体对节点W i n,并统计每个w n的邻居实体对节点数量degree(w n);
    通过预置的相似值计算公式计算待对齐实体对的相似值,所述预置的相似值计算公式为:
    Figure PCTCN2019119307-appb-100003
    其中,e 3和e 4表示一个所述待对齐实体对,deg(w n)表示所述共享原始词w n的邻居实体对节点数量degree(w n),PPR(W i n→<e 3,e 4>)表示在所述配对图G 3中从节点W i n至节点<e 3,e 4>的PPR值。
  12. 根据权利要求8-11中任一所述的知识图谱的实体对齐装置,所述第一生成单元具体用于:
    通过词袋模型对所述两个待融合知识图谱的实体进行向量转换,得到实体向量;
    对所述实体向量进行预处理;
    计算实体向量对之间的初始相似度,所述实体向量对为任意两个实体向量;
    将初始相似度超过预置阈值的实体表示向量对设置为待对齐实体对。
  13. 根据权利要求12所述的知识图谱的实体对齐装置,所述处理模块具体用于:
    计算所述两个待融合知识图谱中所有单词向量的逆文本频率指数IDF值,所述实体向量至少包括一个单词向量;
    删除所述实体向量中IDF值低于预置IDF值的所述单词向量。
  14. 根据权利要求13所述的知识图谱的实体对齐装置,所述计算模块具体用于:
    确定目标实体向量对中相同的所述单词向量;
    通过预置的初始相似度计算公式,计算所述目标实体向量对的初始相似度,所述预置的初始相似度计算公式为:
    Figure PCTCN2019119307-appb-100004
    其中,e 1和e 2表示所述目标实体向量对,ω表示所述目标实体向量对中相同的所述单词向量,idf 1(ω)表示ω在e 1中的IDF值,idf 2(ω)表示ω在e 2中的IDF值。
  15. 一种知识图谱的实体对齐设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:
    获取两个待融合知识图谱的实体;
    根据所述两个待融合知识图谱的实体,生成待对齐实体对;
    基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;
    基于所述配对图,通过网页排名PR算法计算所述待对齐实体对的相似值;
    根据所述待对齐实体对的相似值进行实体对齐。
  16. 根据权利要求15所述的知识图谱的实体对齐设备,所述处理器执行所述计算机程序实现所述基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图时,包括以下步骤:
    获取所述两个待融合知识图谱的图G 1(V 1,E 1)和图G 2(V 2,E 2),V 1表示第一待融合知识图谱中的所有实体,E 1表示所述第一待融合知识图谱的所有三元组,V 2表示第二待融合知识图谱中的所有实体,E 2表示所述第二待融合知识图谱的所有三元组;
    创建所述两个待融合知识图谱的配对图G 3,并在所述配对图G 3中增加表示目标待对齐实体对的初始节点(a s,b s),所述目标实体对a s和b s为任意一个待对齐实体对;
    基于所述初始节点(a s,b s)、图G 1(V 1,E 1)和图G 2(V 2,E 2)增加所述配对图G 3中的其他节点以及连接节点的无向边。
  17. 根据权利要求16所述的知识图谱的实体对齐设备,所述处理器执行所述计算机程序实现所述基于所述初始节点(,)、图和图增加所述配对图中的其他节点以及连接节点的无向边时,包括以下步骤:
    当三元组(a s,p,a n)∈E 1,三元组(b s,p,b n)∈E 2,p为E 1和/或E 2中的关系,a n∈V 1,b n∈V 2,且实体a n和b n的类型相同时在所述配对图G 3中增加节点(a n,b n),以及连接(a s,b s)和(a n,b n)的无向边,(a n,b n)为(a s,b s)的邻居实体对节点;
    当三元组(a n,p,c n)∈E 1,三元组(b n,p,d n)∈E 2,c n∈V 1,d n∈V 2,且实体c n和d n的类型相同,在所述配对图G 3中增加节点(c n,d n),以及连接(a n,b n)和(c n,d n)的无向边;
    当三元组(h n,p,i n)∈E 1,三元组(l n,p,m n)∈E 2,节点(h n,l n)属于配对图G 3,且实体i n和m n的类型相同时,重复在配对图G 3中增加节点(i n,m n),以及连接(h n,l n)和(i n,m n)的无向边,直至遍历E 1和E 2中所有的三元组;
    确定所述初始节点(a s,b s)冲突的所述邻居实体对节点,所述冲突的邻居实体对节点为包含一个相同实体的多个所述邻居实体对节点;
    在所述冲突的邻居实体对节点中,保留初始相似度最大的所述邻居实体对节点以及进行连接的无向边。
  18. 根据权利要求17所述的知识图谱的实体对齐设备,所述处理器执行所述计算机程序实现所述基于所述配对图,通过PR算法计算所述待对齐实体对的相似值时,包括以下步骤:
    确定所述两个待融合知识图谱中的共享原始词w n,所述共享原始词w n为所述两个待融合知识图谱中已确定含义相同的单词;
    确定每个所述共享原始词w n在所述配对图G 3中的邻居实体对节点W i n,并统计每个w n的邻居实体对节点数量degree(w n);
    通过预置的相似值计算公式计算待对齐实体对的相似值,所述预置的相似值计算公式为:
    Figure PCTCN2019119307-appb-100005
    其中,e 3和e 4表示一个所述待对齐实体对,deg(w n)表示所述共享原始词w n的邻居实体对节点数量degree(w n),PPR(W i n→<e 3,e 4>)表示在所述配对图G 3中从节点W i n至节点<e 3,e 4>的PPR值。
  19. 根据权利要求15-18中任一所述的知识图谱的实体对齐设备,所述处 理器执行所述计算机程序实现所述根据所述两个待融合知识图谱的实体,生成待对齐实体对时,包括以下步骤:
    通过词袋模型对所述两个待融合知识图谱的实体进行向量转换,得到实体向量;
    对所述实体向量进行预处理;
    计算实体向量对之间的初始相似度,所述实体向量对为任意两个实体向量;
    将初始相似度超过预置阈值的实体表示向量对设置为待对齐实体对。
  20. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
    获取两个待融合知识图谱的实体;
    根据所述两个待融合知识图谱的实体,生成待对齐实体对;
    基于所述待对齐实体对,生成所述两个待融合知识图谱的配对图;
    基于所述配对图,通过网页排名PR算法计算所述待对齐实体对的相似值;
    根据所述待对齐实体对的相似值进行实体对齐。
PCT/CN2019/119307 2019-10-30 2019-11-19 知识图谱的实体对齐方法、装置、设备及存储介质 WO2021082100A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911043138.9A CN111046186A (zh) 2019-10-30 2019-10-30 知识图谱的实体对齐方法、装置、设备及存储介质
CN201911043138.9 2019-10-30

Publications (1)

Publication Number Publication Date
WO2021082100A1 true WO2021082100A1 (zh) 2021-05-06

Family

ID=70231800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/119307 WO2021082100A1 (zh) 2019-10-30 2019-11-19 知识图谱的实体对齐方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111046186A (zh)
WO (1) WO2021082100A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761221A (zh) * 2021-06-30 2021-12-07 中国人民解放军32801部队 基于图神经网络的知识图谱实体对齐方法
CN114036307A (zh) * 2021-09-17 2022-02-11 清华大学 一种知识图谱实体对齐方法及装置

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046186A (zh) * 2019-10-30 2020-04-21 平安科技(深圳)有限公司 知识图谱的实体对齐方法、装置、设备及存储介质
CN111563192B (zh) * 2020-04-28 2023-05-30 腾讯科技(深圳)有限公司 实体对齐方法、装置、电子设备及存储介质
CN111914559B (zh) * 2020-07-31 2023-04-07 平安科技(深圳)有限公司 基于概率图模型的文本属性抽取方法、装置及计算机设备
CN112258339A (zh) * 2020-09-29 2021-01-22 广东电力通信科技有限公司 一种基于电网系统技术的数据处理储备方法及系统
CN112231416B (zh) * 2020-10-14 2024-06-14 中国平安人寿保险股份有限公司 知识图谱本体更新方法、装置、计算机设备及存储介质
CN112765362B (zh) * 2021-01-12 2023-06-13 中国人民解放军国防科技大学 基于改进自编码器的知识图谱实体对齐方法及相关设备
CN112765370B (zh) * 2021-03-29 2021-07-06 腾讯科技(深圳)有限公司 知识图谱的实体对齐方法、装置、计算机设备和存储介质
CN113220908B (zh) * 2021-07-08 2021-11-05 杭州智会学科技有限公司 知识图谱的匹配方法和装置
CN113536796A (zh) * 2021-07-15 2021-10-22 北京明略昭辉科技有限公司 一种实体对齐辅助方法、装置、设备及存储介质
CN113704495B (zh) * 2021-08-30 2024-05-28 合肥智能语音创新发展有限公司 实体对齐方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202041A (zh) * 2016-07-01 2016-12-07 北京奇虎科技有限公司 一种解决知识图谱中的实体对齐问题的方法和装置
US20180307724A1 (en) * 2017-04-24 2018-10-25 Oath Inc. Reducing query ambiguity using graph matching
CN109739939A (zh) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 知识图谱的数据融合方法和装置
CN109960810A (zh) * 2019-03-28 2019-07-02 科大讯飞(苏州)科技有限公司 一种实体对齐方法及装置
CN111046186A (zh) * 2019-10-30 2020-04-21 平安科技(深圳)有限公司 知识图谱的实体对齐方法、装置、设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480191B (zh) * 2017-07-12 2020-08-21 清华大学 一种迭代的实体对齐模型
CN109597856B (zh) * 2018-12-05 2020-12-25 北京知道创宇信息技术股份有限公司 一种数据处理方法、装置、电子设备及存储介质
CN110188206A (zh) * 2019-05-08 2019-08-30 北京邮电大学 基于翻译模型的协同迭代联合实体对齐方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202041A (zh) * 2016-07-01 2016-12-07 北京奇虎科技有限公司 一种解决知识图谱中的实体对齐问题的方法和装置
US20180307724A1 (en) * 2017-04-24 2018-10-25 Oath Inc. Reducing query ambiguity using graph matching
CN109739939A (zh) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 知识图谱的数据融合方法和装置
CN109960810A (zh) * 2019-03-28 2019-07-02 科大讯飞(苏州)科技有限公司 一种实体对齐方法及装置
CN111046186A (zh) * 2019-10-30 2020-04-21 平安科技(深圳)有限公司 知识图谱的实体对齐方法、装置、设备及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761221A (zh) * 2021-06-30 2021-12-07 中国人民解放军32801部队 基于图神经网络的知识图谱实体对齐方法
CN114036307A (zh) * 2021-09-17 2022-02-11 清华大学 一种知识图谱实体对齐方法及装置
CN114036307B (zh) * 2021-09-17 2022-09-13 清华大学 一种知识图谱实体对齐方法及装置

Also Published As

Publication number Publication date
CN111046186A (zh) 2020-04-21

Similar Documents

Publication Publication Date Title
WO2021082100A1 (zh) 知识图谱的实体对齐方法、装置、设备及存储介质
US10102227B2 (en) Image-based faceted system and method
Ren et al. Cluscite: Effective citation recommendation by information network-based clustering
KR101130374B1 (ko) 유형 내 및 유형 간 관계에 기초하여 객체들을 순위화하기위한 방법 및 시스템
KR101130533B1 (ko) 이종 관계에 기초하여 객체들의 유사성을 결정하기 위한방법 및 시스템
WO2017137000A1 (zh) 对描述同一实体的不同实例进行合并的方法、装置及设备
WO2019127299A1 (zh) 数据查询方法、电子设备及存储介质
CN111026865B (zh) 知识图谱的关系对齐方法、装置、设备及存储介质
WO2021027331A1 (zh) 基于图数据的全量关系计算方法、装置、设备及存储介质
JP7354014B2 (ja) 情報処理装置、情報処理方法及び情報処理プログラム
Zhang et al. Fast eclat algorithms based on minwise hashing for large scale transactions
Epasto et al. Massively parallel and dynamic algorithms for minimum size clustering
Liao et al. A scalable approach for content based image retrieval in cloud datacenter
WO2019127300A1 (zh) 数据存储方法、电子设备及存储介质
CN110019646B (zh) 一种建立索引的方法和装置
KR101780534B1 (ko) 이미지 기반 검색을 위한 맵리듀스 기반의 이미지 특징 추출 방법 및 시스템
CN113821657A (zh) 基于人工智能的图像处理模型训练方法及图像处理方法
CN112765490A (zh) 一种基于知识图谱和图卷积网络的信息推荐方法及系统
Yu et al. Co-simmate: Quick retrieving all pairwise co-simrank scores
WO2013097065A1 (zh) 一种索引数据处理方法及设备
US20160092595A1 (en) Systems And Methods For Processing Graphs
Ayat et al. Entity resolution for distributed probabilistic data
Gong et al. Cb-cloudle: A centroid-based cloud service search engine
Nithiya et al. Integration of Big Data and Cloud Computing To Detect Black Money Check Rotation with Range Aggregate Queries.
KR20200094674A (ko) 에지 프루닝을 이용한 그래프 희소화 방법 및 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19950987

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19950987

Country of ref document: EP

Kind code of ref document: A1