WO2020135048A1

WO2020135048A1 - Data merging method and apparatus for knowledge graph

Info

Publication number: WO2020135048A1
Application number: PCT/CN2019/124552
Authority: WO
Inventors: 刘涛; 朱宏明; 顾江; 姜逸之; 王晓文; 周游
Original assignee: 颖投信息科技(上海)有限公司
Priority date: 2018-12-29
Filing date: 2019-12-11
Publication date: 2020-07-02
Also published as: CN109739939A

Abstract

A data merging method and apparatus for a knowledge graph. A system for implementing the method comprises a data platform configured with a unified access interface. The method comprises: processing data from different data sources and then converting same to a subject-property-object format, storing same in the data platform by means of the unified access interface, and receiving graph data index information returned by the data platform; according to the graph data index information, dividing subjects stored in the data platform into one or more sub-blocks according to the attribute; performing similarity calculation on candidate subjects classified into the same sub-block, and screening matching subject pairs that meet a preset similarity condition; and supplementing and/or replacing subject attribute values of the matching subject pairs to generate unified subject representation. By the abovementioned method, the data merging problem that existing data merging techniques cannot flexibly adapt to different knowledge graphs can be effectively solved.

Description

Data fusion method and device for knowledge graph

This application requires the priority of the Chinese invention patent application with the application number 201811635696.X and the invention titled "Data Fusion Method and Apparatus of Knowledge Graph" filed on December 29, 2018, the entire content of which is hereby incorporated by reference.

Technical field

This application relates to the technical field of knowledge graphs, and in particular, to a data fusion method and device for knowledge graphs.

Background technique

The knowledge graph is a huge semantic network graph that describes various entities or concepts and their relationships in the real world. Its nodes represent entities or concepts, and edges are composed of attributes or relationships. The current knowledge graph has been used to refer to various large-scale knowledge bases. Among them: entity refers to something that is distinguishable and independent, such as a country, a company, a person, etc. Attributes refer to the inherent characteristics of an entity. For example, countries have different attributes such as "population" and "area" (as shown in Figure 4), and companies have attributes such as "name" and "legal representative". A relationship is a characteristic of an association between an entity and another entity. For example, a company is registered in a country, and a person is employed by a company.

The nodes and edges of the knowledge graph are generally defined in the form of triples (SPO, Subject-Property-Object), including (entity 1-relation-entity 2) and (entity-attribute-attribute value), and the knowledge graph can be Represented as a collection of triples, the data model can be represented in the form of a graph (as shown in Figure 4), and a graph database is used for data storage and management.

There are many sources of knowledge in the real world, such as uneven quality of knowledge, duplication of knowledge from different data sources, lack of knowledge base hierarchy, etc. In addition, different data sources may have different knowledge representations for the same entity, for example, in Baidu Encyclopedia A company entity has the name attribute'Alibaba', and the name attribute of a company entity grabbed from google search is'alibaba', these two entities may point to the same entity in the real world, so need to Integrate their attributes and extended relationships to generate unique entity nodes in the knowledge graph, eliminate ambiguity, and generate a high-quality knowledge base.

Existing data fusion solutions generally include the main steps of partition indexing, similarity calculation and entity fusion, but in specific implementation, the corresponding partitioning algorithm, similarity matching algorithm and entity alignment algorithm will be selected according to the characteristics of the data source and knowledge base, and Integrate the above solution into a complete system. When the scope of the data source or knowledge base changes, it is necessary to rebuild the data fusion system in order to adapt to the new requirements.

Summary of the invention

The present application provides a data fusion method and device for knowledge graph, which is used to solve the problem that the existing data fusion technology cannot flexibly adapt to the data fusion of different knowledge bases.

A data fusion method for knowledge graph disclosed in the present application. The system for performing the method includes a data platform configured with a unified access interface. The method includes: processing data from different data sources and converting it into a triplet format , Store to the data platform through the unified access interface, and receive the graph data index information returned by the data platform; according to the graph data index information, the entities stored in the data platform are divided into one or more sub-attributes according to attributes Partition; perform similarity calculation on candidate entity pairs divided into the same sub-division, and select matching entity pairs that meet the preset similarity condition; supplement and/or replace the entity attribute values of the matching entity pairs to generate a unified Entity representation.

Preferably, before the step of dividing the entities stored in the data platform into one or more sub-partitions according to the attributes according to the graph data index information, the method further includes: converting data from multiple data sources into a triplet format The entities stored in the data platform are then aligned according to the actual meaning of their attributes.

Preferably, the sub-partition division method is to perform equivalent division based on a globally unique partition key generated by entity attributes, or to divide based on a preset clustering model.

Preferably, the similarity calculation is performed on the candidate entity pairs divided into the same sub-partition, and the matching entity pairs that meet the preset similarity condition are selected, specifically: the attributes of the entity itself and the attributes of other entities related to the entity Set different weights and weighted sum to calculate the overall similarity of candidate entity pairs; if the overall similarity of candidate entity pairs in the same sub-partition exceeds the preset similarity threshold, the candidate entity pair is regarded as a matching entity pair.

Preferably, the method of supplementing the missing entity attribute value is to obtain it from the network through a crawler or perform manual filling.

Preferably, the graph data index information is a storage address of the graph data in the triplet format on the data platform and its metadata.

A data fusion device for knowledge graph disclosed in this application includes a data platform, a data preprocessing module, an entity partition module, an entity matching module and an entity fusion module, wherein: the data platform is configured with a unified access interface; the data pre-processing The processing module is configured to process data from different data sources and convert it into a triplet format, store to the data platform through the unified access interface, and receive graph data index information returned by the data platform; the entity partitioning module Configured to divide entities stored in the data platform into one or more sub-partitions according to attributes according to the graph data index information output by the data pre-processing module; the entity matching module is configured to divide the entity partition module into Similarity calculation is performed on candidate entity pairs in the same sub-division to select matching entity pairs that meet a preset similarity condition; the entity fusion module is configured to perform entity attribute values of the matching entity pairs selected by the entity matching module Supplement and/or replace to generate a unified physical representation.

Preferably, the entity partitioning module includes an equivalent partitioning submodule and/or a clustering partitioning submodule; the equivalent partitioning submodule is configured to perform global unique partitioning key generation based on entity attributes on entities stored in the data platform Equivalent partitioning; the clustering sub-module is configured to partition entities stored in the data platform based on a preset clustering model.

Preferably, the entity matching module specifically includes a similarity calculation submodule and a comparison submodule; the similarity calculation submodule is configured to set different weights for attributes of the entity itself and attributes of other entities related to the entity, Weighted summation calculates the overall similarity of candidate entity pairs; the comparison submodule is configured to determine whether the overall similarity of candidate entity pairs in the same sub-region exceeds a preset similarity threshold, and if so, the candidate entity pair is considered as a match Entity pair.

Preferably, the device further includes a data processing module and/or an attribute alignment module; the data processing module is configured to process node entity data and edge entity data in the data platform through the unified access interface, and return data processing The result is passed to the next module; the attribute alignment module is configured to align the entities stored in the data platform after the data from multiple data sources are processed by the data preprocessing module according to the actual meaning of their attributes.

The present application also discloses a storage medium on which a program configured to execute the above method is recorded.

Compared with the prior art, this application has the following advantages:

The stages in the preferred embodiment of the present application have upstream and downstream dependencies on the pipeline, but different stages are only constrained by data format and decoupled from each other through the unified interface provided by the data platform, which can be independently developed. The algorithm of each stage can be flexibly replaced. By implementing a custom stage, a new process stage can be inserted between different stages to freely compile custom requirements. In addition, this application has no restrictions on the architecture of the data platform. For example, a Hadoop distributed file system or a cloud computing architecture may be used to facilitate expansion of computing and storage resources when the amount of data increases.

BRIEF DESCRIPTION

The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application. Furthermore, throughout the drawings, the same reference symbols are used to denote the same components. In the drawings:

FIG. 1 is a schematic flowchart of a first embodiment of a data fusion method for knowledge graph of the application;

2 is a schematic flowchart of a second embodiment of a data fusion method for knowledge graph of the application;

3 is a schematic structural diagram of an embodiment of a data fusion device for knowledge graphs of the application;

Figure 4 is a schematic diagram of the graph data model of the knowledge graph.

detailed description

In order to make the above objects, features and advantages of the present application more obvious and understandable, the present application will be described in further detail below with reference to the accompanying drawings and specific embodiments.

In the description of the present application, it should be understood that the terms “first” and “second” are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. The meaning of "plurality" is two or more, unless specifically defined otherwise. The terms "including", "including" and similar terms should be understood as open terms, ie "including/including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one other embodiment". Related definitions of other terms will be given in the description below.

Referring to FIG. 1, the flow of the first embodiment of the data fusion method of the knowledge graph of the present application is shown. The system implementing this method embodiment is provided with a data platform that provides an operating environment and computing resources for each stage, and each stage can pass data. The platform's unified access interface enables interaction. In specific implementation, the data platform can be built on the Hadoop distributed file system, cloud computing architecture (such as Amazon AWS EMR) or other architectures, this application is not limited. The method embodiment includes the following stages:

1. Data preprocessing stage (InputStage): processing data from multiple homogeneous or heterogeneous data sources (such as structured data A and unstructured data B) into the same entity and attribute format (SPO format) As an input to subsequent stages.

By configuring different data source information and data models, data is extracted, cleaned, and transformed from the data source and stored on the data platform in a unified data format. For example, for a relational database data source, by configuring connection information, entity types and entity tables, relationship types and relationship tables, you can extract the required SPO data. For graph databases, nodes (entity-attribute-attribute values) and edges (entity-relationship-entity) are natural SPO structures.

Some configuration parameters in the data preprocessing stage are shown in the following table.

In specific implementation, you can use a custom (CustomInputStage) method to preprocess different data sources, the interface form is as follows:

By reading the configuration defined in the above table, the data can be parsed and stored after reading the remote data. For example, for unstructured data sources, you can call machine learning interfaces, network interfaces, etc. to complete knowledge extraction, save it as triple information, and return the address and metadata information of the saved data.

2. Entity partitioning stage (BlockingStage): Entities from multiple data sources are divided into different sub-partitions (Blocks) according to their attributes to reduce the data size of candidate matching pairs.

For the data sources S and T that need to be matched, it is assumed that the size of the entity data of the data source S is m and the size of the entity data of the data source T is n, and the size of the data that needs to be checked for matching is m*n. In a big data scenario, this data scale is basically unachievable, and the size of the data pairs that need to be matched must be reduced.

During specific implementation, entity pairs that are unlikely to match in the two data sources can be divided into different data partitions in advance, so that the data size in each data partition is greatly reduced, and multiple data partitions can be calculated in parallel.

For example, for corporate entities that need to be matched in S and T, entities that are generally registered in different countries are unlikely to be the same company in the real world, so it can be divided into more than 220 (national or regional) data according to the company's national attributes Partition. For each partition, the sub-partitions can be further divided according to the same or similar attributes. For example, companies that are under the'US' division can continue to be assigned to new divisions based on the same'state' attribute. Finally, the size of the data to be matched is equal to the sum of all data partitions. In subsequent calculations, all data partitions can be calculated in parallel, which can greatly reduce the overall matching time.

Some configuration parameters of the entity partitioning stage are shown in the following table.

In addition, the partitioning method of the entity partitioning stage (BlockingStage) can be extended through a custom partitioning algorithm, for example, through the following interface form:

A globally unique block key (block key) can be generated based on the attributes of the current entity's partition and the next partition to divide the data into the next partition. When the number of possible matching entity pairs of the partition reaches the lowest value or the total number of partitions reaches the maximum value, the partition is no longer divided.

The clustering-based partitioning algorithm can be implemented using the already trained clustering model, and the interface form is as follows:

The clustering model can directly predict the current entity and correspond to a certain class. At this time, the number of partitions is equal to the number of classes in the clustering model. Of course, you can continue to divide the partition on the basis of clustering.

3. Entity matching stage (MatchStage): For candidate entity pairs in the same partition, different weights can be set according to the attributes of the entity itself and the attributes of the entities associated with it, and the candidate entity pair is calculated by weighted summation Overall similarity; select candidate entity pairs that exceed a certain similarity threshold as matching entity pairs.

It should be noted that this process design allows the insertion of some rules based on strong associations to directly complete the matching, such as the company data in the two data sources, if they are both listed companies and the listed stock codes are exactly the same, they will be directly matched , Thereby skipping the similarity calculation process, thereby reducing the computational complexity of the matching phase.

When a verification data set is provided, the results generated by the matching algorithm can be compared with the verification data set to verify the accuracy of the matching algorithm. By adjusting the attributes and weight parameters, and the similarity threshold, the calculation results are compared multiple times to continuously improve accuracy. For example, two company entities are compared by the weighted sum of the similarity between the name and the stock symbol. If the name is expressed in different languages in different data sources, the similarity weight is lower, and the weight needs to be lowered, while the similarity of the stock symbol The relative weight should be set higher.

The entity matching algorithm of this application can adjust the parameters for multiple iterations to improve the accuracy of the matching results.

Some configuration parameters of the entity matching stage (MatchStage) are shown in the following table.

Through a custom entity matching algorithm, you can compare whether two entities point to the same knowledge representation. The interface is as follows:

In the above example, a pre-trained machine learning binary classification model is used, and the attribute similarity vectors of the two entities are used as input to infer the probability of whether they can be classified as the same entity (yes, it is 1).

The last matched entity pair will be output to the result set.

4. Entity fusion stage (MergeStage): The data in different data sources that actually point to the same entity are supplemented, replaced, and normalized according to the fusion algorithm, and the unified entity representation is finally generated.

Generally requires a custom fusion algorithm, the interface form is as follows:

Data fusion can be achieved by combining different business rules. For example, multiple anonymous names can be set, and standardized formats can be used for mailboxes and addresses. The missing attribute data can be filled by crawlers or manual to construct high-quality data, which is convenient for the search and analysis of knowledge graphs.

In a further embodiment, in addition to the stages defined above, stages of different functions (e.g., data processing stage) can also be arranged. The following forms of interface can be used:

The data to be processed is passed through the input configuration parameters, and the output is written to the output after the processing is completed, and passed to the next stage to realize the expansion of the system functions.

This application realizes a general pipeline (Pipeline) for entity fusion in a big data scenario through the above-mentioned means. The pipeline is composed of multiple stages (Stage), each stage can be flexibly expanded through configuration, and custom stages (CustomStage) can be arranged to the pipeline to adapt to different application scenarios. In addition to the data output stage (InputStage) only output, all other stages have input input configuration. Input configuration can specify the entity list, relationship list, data address and related data element information (schema including table name, column name, etc.) from different data sources that need to be obtained during this stage of operation. After this stage, the input data is read, the algorithm is run, written to the data platform, and all data addresses and metadata are output through the output. Therefore, each stage can be run in series through input and output, or it can be run individually by specifying input parameters.

2, the flow of the second embodiment of the data fusion method of the knowledge graph of the present application is shown. The difference from the above-mentioned first method embodiment is that an attribute alignment stage is added between the data preprocessing stage and the entity partitioning stage ( Attribute Matching): used to align the pre-processed entities from multiple data sources stored in the data platform according to the actual meaning of their attributes, such as the "Address" field of data source A and the "Address" field of data source B The fields are aligned, and the fields that are aligned in the subsequent partition and matching phases will be treated as fields with the same meaning.

In the specific implementation, the actual meaning of the entity attribute can be set manually, or can be realized by setting an attribute meaning comparison table in the system, which is not limited in this application.

The present application also discloses a storage medium on which the program for executing the above method is recorded. The storage medium includes any mechanism configured to store or transfer information in a form readable by a computer (taking a computer as an example). For example, storage media include read-only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash storage media, electrical, optical, acoustic, or other forms of propagated signals (eg, carrier waves, infrared Signals, digital signals, etc.) etc.

Referring to FIG. 3, a structural block diagram of an embodiment of a data fusion device for knowledge graphs of the present application is shown, including a data platform 10, a data preprocessing module 11, an entity partitioning module 12, an entity matching module 13, and an entity fusion module 14, wherein:

The data platform 10 is configured with a unified access interface to provide computing and storage services for other modules. This application has no restrictions on the architecture of the data platform. In order to facilitate the expansion of computing and storage resources when the amount of data increases, you can use the Hadoop distributed file system or cloud computing architecture.

The data pre-processing module 11 is used to process the data from different data sources and convert it into a triplet (SPO) format, store it to the data platform 10 through the unified access interface, and receive the graph data index information returned by the data platform 10 . The graph data index information may be the storage address of the graph data in the triplet format in the data platform 10 and its metadata.

The entity partitioning module 12 is used to divide the entities stored in the data platform 10 into one or more sub-partitions according to attributes according to the graph data index information through the unified access interface. In specific implementation, the entity partitioning module 12 may include an equivalent partitioning sub-module for dividing the entities stored in the data platform by the globally unique partition key generated according to the attribute of the entity, and storing the data in the data platform based on the preset clustering model The entities are divided into clustering sub-modules, and/or sub-modules of other partitioning methods.

The entity matching module 13 is configured to perform similarity calculation on candidate entity pairs divided into the same sub-partition, and filter out matching entity pairs that meet a preset similarity condition.

The entity fusion module 14 is used to supplement and/or replace entity attribute values of the matching entity pairs to generate a unified entity representation.

Each functional module of the device embodiment of the present application has upstream and downstream dependencies on the pipeline, but different modules are only constrained by data format and decoupled from each other through the unified interface provided by the data platform, which can be independently developed. The algorithm of each module can be flexibly replaced. By implementing the custom stage, new modules can be inserted between different modules to freely compile custom requirements. For example, in order to improve the adaptability to various data sources and the accuracy of subsequent entity partitioning, matching, and fusion, an attribute alignment module 15 may be inserted between the data preprocessing module 11 and the entity partitioning module 12, for The entities stored in the data platform 10 after being processed by the data preprocessing module 11 of the data source are aligned according to the actual meaning of their attributes. If the "Address" field of data source A is aligned with the "Address" field of data source B, the aligned fields in the subsequent partition and matching phases will be treated as fields of the same meaning.

In a further preferred device embodiment, the entity matching module 13 may specifically include a similarity calculation sub-module and a comparison sub-module; the similarity calculation sub-module is used for attributes of the entity itself and attributes of other entities related to the entity Set different weights and weighted sum to calculate the overall similarity of the candidate entity pairs; the comparison submodule is used to determine whether the overall similarity of the candidate entity pairs in the same sub-region exceeds the preset similarity threshold, if so, the candidate Entity pairs are used as matching entity pairs.

In another preferred device embodiment, the device may further include a data processing module for processing node entity data and edge entity data in the data platform through the unified access interface, and returning the data processing result to the next One module.

The above data processing module can be implemented in the following forms:

Among them, the data to be processed is transmitted through the input configuration parameters, and after the data processing is completed, the result is written to the output and passed to the function module in the next stage to realize the expansion of the device function.

The embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments may refer to each other. For the device embodiments of the present application, since they are basically similar to the method embodiments, the description is relatively simple, and the relevant parts can be referred to the description in the method embodiments. The device embodiments described above are only schematic, wherein the modules described as separate components may or may not be physically separated, and may be located in one place or may be distributed on multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without paying creative labor.

In this article, specific examples are used to explain the principle and implementation of this application. The descriptions of the above examples are only used to help understand the method and core ideas of this application; meanwhile, for ordinary technicians in this field, according to this application The thoughts of this book may change in the specific implementation mode and application scope. In summary, the content of this specification should not be understood as a limitation to this application.

Claims

A data fusion method for a knowledge graph. The knowledge graph system that executes the method includes a data platform configured with a unified access interface. The method includes:

Processing data from different data sources into a triplet format, storing to a data platform through the unified access interface, and receiving graph data index information returned by the data platform;

According to the graph data index information, the entities stored in the data platform are divided into one or more sub-partitions according to attributes;

Perform similarity calculation on candidate entity pairs divided into the same sub-division, and select matching entity pairs that meet preset similarity conditions;

Supplement and/or replace the entity attribute values of the matching entity pairs to generate a unified entity representation.
The method according to claim 1, wherein before the step of dividing the entities stored in the data platform into one or more sub-partitions according to the attributes according to the graph data index information, the method further includes: After converting the data into a triple format, the entities stored in the data platform are aligned according to the actual meaning of their attributes.
The method according to claim 1, wherein the division method of the sub-partitions is to perform an equivalent division based on a globally unique partition key generated by an entity attribute, or to divide based on a preset clustering model.
The method according to claim 1, wherein the similarity calculation is performed on the candidate entity pairs divided into the same sub-partition, and the matching entity pairs that meet the preset similarity condition are selected, specifically:

Set different weights for the attributes of the entity itself and the attributes of other entities related to the entity, weighted summation to calculate the overall similarity of the candidate entity pairs;

If the overall similarity of candidate entity pairs in the same sub-partition exceeds a preset similarity threshold, the candidate entity pair is regarded as a matching entity pair.
The method according to claim 1, wherein the method of supplementing the missing entity attribute value is to obtain it from the network through a crawler or perform manual filling.
The method according to claim 1, wherein the graph data index information is a storage address of the graph data in the triplet format on the data platform and its metadata.
A knowledge graph data fusion device includes a data platform, a data preprocessing module, an entity partition module, an entity matching module and an entity fusion module, where:

The data platform includes a unified access interface;

The data pre-processing module is configured to process data from different data sources and convert them into a triplet format, store them in a data platform through the unified access interface, and receive graph data index information returned by the data platform;

The entity partitioning module is configured to divide the entities stored in the data platform into one or more sub-partitions according to attributes according to the graph data index information output by the data preprocessing module;

The entity matching module is configured to divide the entity partition module into candidate entity pairs in the same sub-region to perform similarity calculation, and filter out matching entity pairs that meet a preset similarity condition;

The entity fusion module is configured to supplement and/or replace the entity attribute values of the matching entity pairs selected by the entity matching module to generate a unified entity representation.
The device according to claim 7, wherein

The entity partitioning module includes an equivalent partitioning submodule and/or a clustering partitioning submodule;

The equivalent partitioning submodule is configured to divide the entities stored in the data platform into equal values according to the globally unique partitioning key generated by the entity attributes;

The clustering sub-module is configured to divide entities stored in the data platform based on a preset clustering model;

The entity matching module specifically includes a similarity calculation submodule and a comparison submodule;

The similarity calculation sub-module is configured to set different weights for attributes of the entity itself and attributes of other entities related to the entity, and weighted sum to calculate the overall similarity of the candidate entity pair;

The comparison sub-module is configured to determine whether the overall similarity of candidate entity pairs in the same sub-partition exceeds a preset similarity threshold, and if so, use the candidate entity pair as a matching entity pair.
The device according to claim 7, wherein the device further comprises a data processing module and/or an attribute alignment module;

The data processing module is configured to process node entity data and edge entity data in the data platform through the unified access interface, and return the data processing result to the next module;

The attribute alignment module is configured to align the entities stored in the data platform after the data preprocessing module processes the data from multiple data sources according to the actual meaning of their attributes.
A storage medium storing a program configured to perform the method according to any one of claims 1 to 6.