WO2020135048A1 - Data merging method and apparatus for knowledge graph - Google Patents

Data merging method and apparatus for knowledge graph Download PDF

Info

Publication number
WO2020135048A1
WO2020135048A1 PCT/CN2019/124552 CN2019124552W WO2020135048A1 WO 2020135048 A1 WO2020135048 A1 WO 2020135048A1 CN 2019124552 W CN2019124552 W CN 2019124552W WO 2020135048 A1 WO2020135048 A1 WO 2020135048A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
entity
module
matching
attributes
Prior art date
Application number
PCT/CN2019/124552
Other languages
French (fr)
Chinese (zh)
Inventor
刘涛
朱宏明
顾江
姜逸之
王晓文
周游
Original Assignee
颖投信息科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 颖投信息科技(上海)有限公司 filed Critical 颖投信息科技(上海)有限公司
Publication of WO2020135048A1 publication Critical patent/WO2020135048A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models

Definitions

  • This application relates to the technical field of knowledge graphs, and in particular, to a data fusion method and device for knowledge graphs.
  • the knowledge graph is a huge semantic network graph that describes various entities or concepts and their relationships in the real world. Its nodes represent entities or concepts, and edges are composed of attributes or relationships.
  • entity refers to something that is distinguishable and independent, such as a country, a company, a person, etc. Attributes refer to the inherent characteristics of an entity. For example, countries have different attributes such as "population” and "area” (as shown in Figure 4), and companies have attributes such as "name” and "legal representative”.
  • a relationship is a characteristic of an association between an entity and another entity. For example, a company is registered in a country, and a person is employed by a company.
  • the nodes and edges of the knowledge graph are generally defined in the form of triples (SPO, Subject-Property-Object), including (entity 1-relation-entity 2) and (entity-attribute-attribute value), and the knowledge graph can be Represented as a collection of triples, the data model can be represented in the form of a graph (as shown in Figure 4), and a graph database is used for data storage and management.
  • SPO Subject-Property-Object
  • entity 1-relation-entity 2 entity 1-relation-entity 2
  • entity-attribute-attribute value entity-attribute value
  • Existing data fusion solutions generally include the main steps of partition indexing, similarity calculation and entity fusion, but in specific implementation, the corresponding partitioning algorithm, similarity matching algorithm and entity alignment algorithm will be selected according to the characteristics of the data source and knowledge base, and Integrate the above solution into a complete system. When the scope of the data source or knowledge base changes, it is necessary to rebuild the data fusion system in order to adapt to the new requirements.
  • the present application provides a data fusion method and device for knowledge graph, which is used to solve the problem that the existing data fusion technology cannot flexibly adapt to the data fusion of different knowledge bases.
  • a data fusion method for knowledge graph disclosed in the present application includes a data platform configured with a unified access interface.
  • the method includes: processing data from different data sources and converting it into a triplet format , Store to the data platform through the unified access interface, and receive the graph data index information returned by the data platform; according to the graph data index information, the entities stored in the data platform are divided into one or more sub-attributes according to attributes Partition; perform similarity calculation on candidate entity pairs divided into the same sub-division, and select matching entity pairs that meet the preset similarity condition; supplement and/or replace the entity attribute values of the matching entity pairs to generate a unified Entity representation.
  • the method further includes: converting data from multiple data sources into a triplet format The entities stored in the data platform are then aligned according to the actual meaning of their attributes.
  • the sub-partition division method is to perform equivalent division based on a globally unique partition key generated by entity attributes, or to divide based on a preset clustering model.
  • the similarity calculation is performed on the candidate entity pairs divided into the same sub-partition, and the matching entity pairs that meet the preset similarity condition are selected, specifically: the attributes of the entity itself and the attributes of other entities related to the entity Set different weights and weighted sum to calculate the overall similarity of candidate entity pairs; if the overall similarity of candidate entity pairs in the same sub-partition exceeds the preset similarity threshold, the candidate entity pair is regarded as a matching entity pair.
  • the method of supplementing the missing entity attribute value is to obtain it from the network through a crawler or perform manual filling.
  • the graph data index information is a storage address of the graph data in the triplet format on the data platform and its metadata.
  • a data fusion device for knowledge graph disclosed in this application includes a data platform, a data preprocessing module, an entity partition module, an entity matching module and an entity fusion module, wherein: the data platform is configured with a unified access interface; the data pre-processing The processing module is configured to process data from different data sources and convert it into a triplet format, store to the data platform through the unified access interface, and receive graph data index information returned by the data platform; the entity partitioning module Configured to divide entities stored in the data platform into one or more sub-partitions according to attributes according to the graph data index information output by the data pre-processing module; the entity matching module is configured to divide the entity partition module into Similarity calculation is performed on candidate entity pairs in the same sub-division to select matching entity pairs that meet a preset similarity condition; the entity fusion module is configured to perform entity attribute values of the matching entity pairs selected by the entity matching module Supplement and/or replace to generate a unified physical representation.
  • the entity partitioning module includes an equivalent partitioning submodule and/or a clustering partitioning submodule; the equivalent partitioning submodule is configured to perform global unique partitioning key generation based on entity attributes on entities stored in the data platform Equivalent partitioning; the clustering sub-module is configured to partition entities stored in the data platform based on a preset clustering model.
  • the entity matching module specifically includes a similarity calculation submodule and a comparison submodule; the similarity calculation submodule is configured to set different weights for attributes of the entity itself and attributes of other entities related to the entity, Weighted summation calculates the overall similarity of candidate entity pairs; the comparison submodule is configured to determine whether the overall similarity of candidate entity pairs in the same sub-region exceeds a preset similarity threshold, and if so, the candidate entity pair is considered as a match Entity pair.
  • the similarity calculation submodule is configured to set different weights for attributes of the entity itself and attributes of other entities related to the entity, Weighted summation calculates the overall similarity of candidate entity pairs
  • the comparison submodule is configured to determine whether the overall similarity of candidate entity pairs in the same sub-region exceeds a preset similarity threshold, and if so, the candidate entity pair is considered as a match Entity pair.
  • the device further includes a data processing module and/or an attribute alignment module;
  • the data processing module is configured to process node entity data and edge entity data in the data platform through the unified access interface, and return data processing The result is passed to the next module;
  • the attribute alignment module is configured to align the entities stored in the data platform after the data from multiple data sources are processed by the data preprocessing module according to the actual meaning of their attributes.
  • the present application also discloses a storage medium on which a program configured to execute the above method is recorded.
  • the stages in the preferred embodiment of the present application have upstream and downstream dependencies on the pipeline, but different stages are only constrained by data format and decoupled from each other through the unified interface provided by the data platform, which can be independently developed.
  • the algorithm of each stage can be flexibly replaced.
  • a new process stage can be inserted between different stages to freely compile custom requirements.
  • this application has no restrictions on the architecture of the data platform. For example, a Hadoop distributed file system or a cloud computing architecture may be used to facilitate expansion of computing and storage resources when the amount of data increases.
  • FIG. 1 is a schematic flowchart of a first embodiment of a data fusion method for knowledge graph of the application
  • FIG. 2 is a schematic flowchart of a second embodiment of a data fusion method for knowledge graph of the application
  • FIG. 3 is a schematic structural diagram of an embodiment of a data fusion device for knowledge graphs of the application
  • Figure 4 is a schematic diagram of the graph data model of the knowledge graph.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features.
  • the features defined as “first” and “second” may explicitly or implicitly include one or more of the features.
  • the meaning of “plurality” is two or more, unless specifically defined otherwise.
  • the terms “including”, “including” and similar terms should be understood as open terms, ie “including/including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • one embodiment means “at least one embodiment”; the term “another embodiment” means “at least one other embodiment”.
  • Related definitions of other terms will be given in the description below.
  • the system implementing this method embodiment is provided with a data platform that provides an operating environment and computing resources for each stage, and each stage can pass data.
  • the platform's unified access interface enables interaction.
  • the data platform can be built on the Hadoop distributed file system, cloud computing architecture (such as Amazon AWS EMR) or other architectures, this application is not limited.
  • the method embodiment includes the following stages:
  • Data preprocessing stage processing data from multiple homogeneous or heterogeneous data sources (such as structured data A and unstructured data B) into the same entity and attribute format (SPO format) As an input to subsequent stages.
  • data is extracted, cleaned, and transformed from the data source and stored on the data platform in a unified data format.
  • a relational database data source by configuring connection information, entity types and entity tables, relationship types and relationship tables, you can extract the required SPO data.
  • nodes entity-attribute-attribute values
  • edges entity-relationship-entity
  • the data can be parsed and stored after reading the remote data.
  • you can call machine learning interfaces, network interfaces, etc. to complete knowledge extraction, save it as triple information, and return the address and metadata information of the saved data.
  • Entity partitioning stage Entities from multiple data sources are divided into different sub-partitions (Blocks) according to their attributes to reduce the data size of candidate matching pairs.
  • the size of the entity data of the data source S is m and the size of the entity data of the data source T is n, and the size of the data that needs to be checked for matching is m*n.
  • this data scale is basically unachievable, and the size of the data pairs that need to be matched must be reduced.
  • entity pairs that are unlikely to match in the two data sources can be divided into different data partitions in advance, so that the data size in each data partition is greatly reduced, and multiple data partitions can be calculated in parallel.
  • the partitioning method of the entity partitioning stage (BlockingStage) can be extended through a custom partitioning algorithm, for example, through the following interface form:
  • a globally unique block key (block key) can be generated based on the attributes of the current entity's partition and the next partition to divide the data into the next partition. When the number of possible matching entity pairs of the partition reaches the lowest value or the total number of partitions reaches the maximum value, the partition is no longer divided.
  • the clustering-based partitioning algorithm can be implemented using the already trained clustering model, and the interface form is as follows:
  • the clustering model can directly predict the current entity and correspond to a certain class. At this time, the number of partitions is equal to the number of classes in the clustering model. Of course, you can continue to divide the partition on the basis of clustering.
  • Entity matching stage For candidate entity pairs in the same partition, different weights can be set according to the attributes of the entity itself and the attributes of the entities associated with it, and the candidate entity pair is calculated by weighted summation Overall similarity; select candidate entity pairs that exceed a certain similarity threshold as matching entity pairs.
  • this process design allows the insertion of some rules based on strong associations to directly complete the matching, such as the company data in the two data sources, if they are both listed companies and the listed stock codes are exactly the same, they will be directly matched , Thereby skipping the similarity calculation process, thereby reducing the computational complexity of the matching phase.
  • the results generated by the matching algorithm can be compared with the verification data set to verify the accuracy of the matching algorithm.
  • the calculation results are compared multiple times to continuously improve accuracy. For example, two company entities are compared by the weighted sum of the similarity between the name and the stock symbol. If the name is expressed in different languages in different data sources, the similarity weight is lower, and the weight needs to be lowered, while the similarity of the stock symbol The relative weight should be set higher.
  • the entity matching algorithm of this application can adjust the parameters for multiple iterations to improve the accuracy of the matching results.
  • a pre-trained machine learning binary classification model is used, and the attribute similarity vectors of the two entities are used as input to infer the probability of whether they can be classified as the same entity (yes, it is 1).
  • the last matched entity pair will be output to the result set.
  • Entity fusion stage (MergeStage): The data in different data sources that actually point to the same entity are supplemented, replaced, and normalized according to the fusion algorithm, and the unified entity representation is finally generated.
  • Data fusion can be achieved by combining different business rules. For example, multiple anonymous names can be set, and standardized formats can be used for mailboxes and addresses.
  • the missing attribute data can be filled by crawlers or manual to construct high-quality data, which is convenient for the search and analysis of knowledge graphs.
  • stages of different functions can also be arranged.
  • the following forms of interface can be used:
  • the data to be processed is passed through the input configuration parameters, and the output is written to the output after the processing is completed, and passed to the next stage to realize the expansion of the system functions.
  • This application realizes a general pipeline (Pipeline) for entity fusion in a big data scenario through the above-mentioned means.
  • the pipeline is composed of multiple stages (Stage), each stage can be flexibly expanded through configuration, and custom stages (CustomStage) can be arranged to the pipeline to adapt to different application scenarios.
  • Input configuration can specify the entity list, relationship list, data address and related data element information (schema including table name, column name, etc.) from different data sources that need to be obtained during this stage of operation.
  • the input data is read, the algorithm is run, written to the data platform, and all data addresses and metadata are output through the output. Therefore, each stage can be run in series through input and output, or it can be run individually by specifying input parameters.
  • an attribute alignment stage is added between the data preprocessing stage and the entity partitioning stage ( Attribute Matching): used to align the pre-processed entities from multiple data sources stored in the data platform according to the actual meaning of their attributes, such as the "Address" field of data source A and the "Address” field of data source B
  • Attribute Matching used to align the pre-processed entities from multiple data sources stored in the data platform according to the actual meaning of their attributes, such as the "Address" field of data source A and the "Address” field of data source B
  • the fields are aligned, and the fields that are aligned in the subsequent partition and matching phases will be treated as fields with the same meaning.
  • the actual meaning of the entity attribute can be set manually, or can be realized by setting an attribute meaning comparison table in the system, which is not limited in this application.
  • the present application also discloses a storage medium on which the program for executing the above method is recorded.
  • the storage medium includes any mechanism configured to store or transfer information in a form readable by a computer (taking a computer as an example).
  • storage media include read-only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash storage media, electrical, optical, acoustic, or other forms of propagated signals (eg, carrier waves, infrared Signals, digital signals, etc.) etc.
  • FIG. 3 a structural block diagram of an embodiment of a data fusion device for knowledge graphs of the present application is shown, including a data platform 10, a data preprocessing module 11, an entity partitioning module 12, an entity matching module 13, and an entity fusion module 14, wherein:
  • the data platform 10 is configured with a unified access interface to provide computing and storage services for other modules.
  • This application has no restrictions on the architecture of the data platform. In order to facilitate the expansion of computing and storage resources when the amount of data increases, you can use the Hadoop distributed file system or cloud computing architecture.
  • the data pre-processing module 11 is used to process the data from different data sources and convert it into a triplet (SPO) format, store it to the data platform 10 through the unified access interface, and receive the graph data index information returned by the data platform 10 .
  • the graph data index information may be the storage address of the graph data in the triplet format in the data platform 10 and its metadata.
  • the entity partitioning module 12 is used to divide the entities stored in the data platform 10 into one or more sub-partitions according to attributes according to the graph data index information through the unified access interface.
  • the entity partitioning module 12 may include an equivalent partitioning sub-module for dividing the entities stored in the data platform by the globally unique partition key generated according to the attribute of the entity, and storing the data in the data platform based on the preset clustering model The entities are divided into clustering sub-modules, and/or sub-modules of other partitioning methods.
  • the entity matching module 13 is configured to perform similarity calculation on candidate entity pairs divided into the same sub-partition, and filter out matching entity pairs that meet a preset similarity condition.
  • the entity fusion module 14 is used to supplement and/or replace entity attribute values of the matching entity pairs to generate a unified entity representation.
  • Each functional module of the device embodiment of the present application has upstream and downstream dependencies on the pipeline, but different modules are only constrained by data format and decoupled from each other through the unified interface provided by the data platform, which can be independently developed.
  • the algorithm of each module can be flexibly replaced.
  • new modules can be inserted between different modules to freely compile custom requirements.
  • an attribute alignment module 15 may be inserted between the data preprocessing module 11 and the entity partitioning module 12, for The entities stored in the data platform 10 after being processed by the data preprocessing module 11 of the data source are aligned according to the actual meaning of their attributes. If the "Address" field of data source A is aligned with the "Address" field of data source B, the aligned fields in the subsequent partition and matching phases will be treated as fields of the same meaning.
  • the entity matching module 13 may specifically include a similarity calculation sub-module and a comparison sub-module; the similarity calculation sub-module is used for attributes of the entity itself and attributes of other entities related to the entity Set different weights and weighted sum to calculate the overall similarity of the candidate entity pairs; the comparison submodule is used to determine whether the overall similarity of the candidate entity pairs in the same sub-region exceeds the preset similarity threshold, if so, the candidate Entity pairs are used as matching entity pairs.
  • the similarity calculation sub-module is used for attributes of the entity itself and attributes of other entities related to the entity Set different weights and weighted sum to calculate the overall similarity of the candidate entity pairs
  • the comparison submodule is used to determine whether the overall similarity of the candidate entity pairs in the same sub-region exceeds the preset similarity threshold, if so, the candidate Entity pairs are used as matching entity pairs.
  • the device may further include a data processing module for processing node entity data and edge entity data in the data platform through the unified access interface, and returning the data processing result to the next One module.
  • the above data processing module can be implemented in the following forms:
  • the data to be processed is transmitted through the input configuration parameters, and after the data processing is completed, the result is written to the output and passed to the function module in the next stage to realize the expansion of the device function.

Abstract

A data merging method and apparatus for a knowledge graph. A system for implementing the method comprises a data platform configured with a unified access interface. The method comprises: processing data from different data sources and then converting same to a subject-property-object format, storing same in the data platform by means of the unified access interface, and receiving graph data index information returned by the data platform; according to the graph data index information, dividing subjects stored in the data platform into one or more sub-blocks according to the attribute; performing similarity calculation on candidate subjects classified into the same sub-block, and screening matching subject pairs that meet a preset similarity condition; and supplementing and/or replacing subject attribute values of the matching subject pairs to generate unified subject representation. By the abovementioned method, the data merging problem that existing data merging techniques cannot flexibly adapt to different knowledge graphs can be effectively solved.

Description

知识图谱的数据融合方法和装置Data fusion method and device for knowledge graph
本申请要求2018年12月29日提交的申请号为201811635696.X、发明名称为“知识图谱的数据融合方法和装置”的中国发明专利申请的优先权,其全文引用在此供参考。This application requires the priority of the Chinese invention patent application with the application number 201811635696.X and the invention titled "Data Fusion Method and Apparatus of Knowledge Graph" filed on December 29, 2018, the entire content of which is hereby incorporated by reference.
技术领域Technical field
本申请涉及知识图谱技术领域,特别地,涉及一种知识图谱的数据融合方法和装置。This application relates to the technical field of knowledge graphs, and in particular, to a data fusion method and device for knowledge graphs.
背景技术Background technique
知识图谱是一种描述现实世界中存在的各种实体或概念及其关系而构成的一张巨大的语义网络图,其节点表示实体或概念,边则由属性或关系构成。现在的知识图谱已被用来泛指各种大规模的知识库。其中:实体是指具有可区别性且独立存在的某种事物,比如某个国家、某家公司、某个人等。属性是指一个实体的内在特性,比如国家具有“人口”、“面积”等不同属性(如图4所示),公司具有“名称”、“法定代表人”等属性。关系是一个实体与另一个实体的关联特征,比如某个公司注册在某个国家,某个人就职于某个公司等。The knowledge graph is a huge semantic network graph that describes various entities or concepts and their relationships in the real world. Its nodes represent entities or concepts, and edges are composed of attributes or relationships. The current knowledge graph has been used to refer to various large-scale knowledge bases. Among them: entity refers to something that is distinguishable and independent, such as a country, a company, a person, etc. Attributes refer to the inherent characteristics of an entity. For example, countries have different attributes such as "population" and "area" (as shown in Figure 4), and companies have attributes such as "name" and "legal representative". A relationship is a characteristic of an association between an entity and another entity. For example, a company is registered in a country, and a person is employed by a company.
知识图谱的节点和边一般用三元组(S-P-O,Subject-Property-Object)的形式来定义,包括(实体1-关系-实体2)和(实体-属性-属性值)等形式,知识图谱可以表示为三元组的集合,在数据模型上可以表现为图的形式(如图4所示),并采用图数据库来进行数据的存储和管理。The nodes and edges of the knowledge graph are generally defined in the form of triples (SPO, Subject-Property-Object), including (entity 1-relation-entity 2) and (entity-attribute-attribute value), and the knowledge graph can be Represented as a collection of triples, the data model can be represented in the form of a graph (as shown in Figure 4), and a graph database is used for data storage and management.
现实世界中知识来源广泛,存在知识质量良莠不齐、来自不同数据源的知识重复、知识库层次结构缺失等问题;另外,不同的数据源对于同一实体可能有不同的知识表示,比如,在百度百科中某个公司实体具有名称属性‘阿里巴巴’,而从google搜索中抓取到的某个公司实体的名称属性是‘alibaba’,这两个实体在现实世界中有可能指向同一个实体,因此需要将他们的属性以及延伸的关系进行互相融合,从而在知识图谱中生成唯一的实体节点,消除歧义,生成高质量的知识库。There are many sources of knowledge in the real world, such as uneven quality of knowledge, duplication of knowledge from different data sources, lack of knowledge base hierarchy, etc. In addition, different data sources may have different knowledge representations for the same entity, for example, in Baidu Encyclopedia A company entity has the name attribute'Alibaba', and the name attribute of a company entity grabbed from google search is'alibaba', these two entities may point to the same entity in the real world, so need to Integrate their attributes and extended relationships to generate unique entity nodes in the knowledge graph, eliminate ambiguity, and generate a high-quality knowledge base.
现有数据融合方案一般包括分区索引、相似度计算和实体融合等主要步骤,但在具体实现时会根据数据源以及知识库的特点选择对应的分区算法、相似度 匹配算法和实体对齐算法,并将上述方案集成为一个完整的系统,当数据源或知识库的范围发生变化时,为适应新的需求,需要重新构建数据融合系统。Existing data fusion solutions generally include the main steps of partition indexing, similarity calculation and entity fusion, but in specific implementation, the corresponding partitioning algorithm, similarity matching algorithm and entity alignment algorithm will be selected according to the characteristics of the data source and knowledge base, and Integrate the above solution into a complete system. When the scope of the data source or knowledge base changes, it is necessary to rebuild the data fusion system in order to adapt to the new requirements.
发明内容Summary of the invention
本申请提供一种知识图谱的数据融合方法和装置,用于解决现有数据融合技术不能灵活适应不同知识库的数据融合问题。The present application provides a data fusion method and device for knowledge graph, which is used to solve the problem that the existing data fusion technology cannot flexibly adapt to the data fusion of different knowledge bases.
本申请公开的一种知识图谱的数据融合方法,执行所述方法的系统包括配置有统一访问接口的数据平台,所述方法包括:将来自不同数据源的数据进行处理后转换为三元组格式,通过所述统一访问接口存储到数据平台,并接收所述数据平台返回的图数据索引信息;根据所述图数据索引信息,将所述数据平台中存储的实体按属性划分为一个或多个子分区;对划分到相同子分区中的候选实体对进行相似度计算,筛选出满足预设相似度条件的匹配实体对;对所述匹配实体对的实体属性值进行补充和/或替换,生成统一的实体表示。A data fusion method for knowledge graph disclosed in the present application. The system for performing the method includes a data platform configured with a unified access interface. The method includes: processing data from different data sources and converting it into a triplet format , Store to the data platform through the unified access interface, and receive the graph data index information returned by the data platform; according to the graph data index information, the entities stored in the data platform are divided into one or more sub-attributes according to attributes Partition; perform similarity calculation on candidate entity pairs divided into the same sub-division, and select matching entity pairs that meet the preset similarity condition; supplement and/or replace the entity attribute values of the matching entity pairs to generate a unified Entity representation.
优选地,在步骤根据所述图数据索引信息,将所述数据平台中存储的实体按属性划分为一个或多个子分区之前,还包括:将来自多个数据源的数据转换为三元组格式之后存储在数据平台中的实体根据其属性的实际含义进行对齐。Preferably, before the step of dividing the entities stored in the data platform into one or more sub-partitions according to the attributes according to the graph data index information, the method further includes: converting data from multiple data sources into a triplet format The entities stored in the data platform are then aligned according to the actual meaning of their attributes.
优选地,所述子分区划分方式为根据实体属性产生的全局唯一分区键进行等值划分,或基于预设聚类模型进行划分。Preferably, the sub-partition division method is to perform equivalent division based on a globally unique partition key generated by entity attributes, or to divide based on a preset clustering model.
优选地,对划分到相同子分区中的候选实体对进行相似度计算,筛选出满足预设相似度条件的匹配实体对,具体为:为实体本身的属性以及与该实体相关的其他实体的属性分别设置不同的权重,加权求和计算候选实体对的总体相似度;若相同子分区中的候选实体对的总体相似度超过预设相似度阈值,则将该候选实体对作为匹配实体对。Preferably, the similarity calculation is performed on the candidate entity pairs divided into the same sub-partition, and the matching entity pairs that meet the preset similarity condition are selected, specifically: the attributes of the entity itself and the attributes of other entities related to the entity Set different weights and weighted sum to calculate the overall similarity of candidate entity pairs; if the overall similarity of candidate entity pairs in the same sub-partition exceeds the preset similarity threshold, the candidate entity pair is regarded as a matching entity pair.
优选地,对缺失的实体属性值进行补充的方法为通过爬虫从网络获取或进行人工填充。Preferably, the method of supplementing the missing entity attribute value is to obtain it from the network through a crawler or perform manual filling.
优选地,所述图数据索引信息为三元组格式的图数据在所述数据平台的存储地址及其元数据。Preferably, the graph data index information is a storage address of the graph data in the triplet format on the data platform and its metadata.
本申请公开的一种知识图谱的数据融合装置,包括数据平台、数据预处理模块、实体分区模块、实体匹配模块和实体融合模块,其中:所述数据平台配置有统一访问接口;所述数据预处理模块配置为将来自不同数据源的数据进行 处理后转换为三元组格式,通过所述统一访问接口存储到数据平台,并接收所述数据平台返回的图数据索引信息;所述实体分区模块配置为根据所述数据预处理模块输出的图数据索引信息,将所述数据平台中存储的实体按属性划分为一个或多个子分区;所述实体匹配模块配置为将所述实体分区模块划分到相同子分区中的候选实体对进行相似度计算,筛选出满足预设相似度条件的匹配实体对;所述实体融合模块配置为对所述实体匹配模块筛选出的匹配实体对的实体属性值进行补充和/或替换,生成统一的实体表示。A data fusion device for knowledge graph disclosed in this application includes a data platform, a data preprocessing module, an entity partition module, an entity matching module and an entity fusion module, wherein: the data platform is configured with a unified access interface; the data pre-processing The processing module is configured to process data from different data sources and convert it into a triplet format, store to the data platform through the unified access interface, and receive graph data index information returned by the data platform; the entity partitioning module Configured to divide entities stored in the data platform into one or more sub-partitions according to attributes according to the graph data index information output by the data pre-processing module; the entity matching module is configured to divide the entity partition module into Similarity calculation is performed on candidate entity pairs in the same sub-division to select matching entity pairs that meet a preset similarity condition; the entity fusion module is configured to perform entity attribute values of the matching entity pairs selected by the entity matching module Supplement and/or replace to generate a unified physical representation.
优选地,所述实体分区模块包括等值分区子模块和/或聚类分区子模块;所述等值分区子模块配置为根据实体属性产生的全局唯一分区键对存储在数据平台中的实体进行等值划分;所述聚类分区子模块配置为基于预设聚类模型对存储在数据平台中的实体进行划分。Preferably, the entity partitioning module includes an equivalent partitioning submodule and/or a clustering partitioning submodule; the equivalent partitioning submodule is configured to perform global unique partitioning key generation based on entity attributes on entities stored in the data platform Equivalent partitioning; the clustering sub-module is configured to partition entities stored in the data platform based on a preset clustering model.
优选地,所述实体匹配模块具体包括相似度计算子模块和比较子模块;所述相似度计算子模块配置为为实体本身的属性以及与该实体相关的其他实体的属性分别设置不同的权重,加权求和计算候选实体对的总体相似度;所述比较子模块配置为判断相同子分区中的候选实体对的总体相似度是否超过预设相似度阈值,若是,则将该候选实体对作为匹配实体对。Preferably, the entity matching module specifically includes a similarity calculation submodule and a comparison submodule; the similarity calculation submodule is configured to set different weights for attributes of the entity itself and attributes of other entities related to the entity, Weighted summation calculates the overall similarity of candidate entity pairs; the comparison submodule is configured to determine whether the overall similarity of candidate entity pairs in the same sub-region exceeds a preset similarity threshold, and if so, the candidate entity pair is considered as a match Entity pair.
优选地,所述装置还包括数据处理模块和/或属性对齐模块;所述数据处理模块配置为通过所述统一访问接口对数据平台中的节点实体数据和边实体数据进行处理,并返回数据处理结果传递给下一个模块;所述属性对齐模块配置为将来自多个数据源的数据经所述数据预处理模块处理后存储在数据平台中的实体根据其属性的实际含义进行对齐。Preferably, the device further includes a data processing module and/or an attribute alignment module; the data processing module is configured to process node entity data and edge entity data in the data platform through the unified access interface, and return data processing The result is passed to the next module; the attribute alignment module is configured to align the entities stored in the data platform after the data from multiple data sources are processed by the data preprocessing module according to the actual meaning of their attributes.
本申请还公开了一种在其上记录有配置为执行上述方法的程序的存储介质。The present application also discloses a storage medium on which a program configured to execute the above method is recorded.
与现有技术相比,本申请具有以下优点:Compared with the prior art, this application has the following advantages:
本申请优选实施例方案中的各阶段在流水线上具有上下游依赖关系,但不同阶段之间仅通过数据格式约束,通过数据平台提供的统一接口实现相互解耦,可独立开发完成。各阶段的算法本身可以灵活替换,通过实现自定义阶段,可以在不同阶段之间插入新的流程阶段,自由编制完成自定义需求。另外,本申请对数据平台的架构没有限制,例如可以采用Hadoop分布式文件系统或云计算架构,以方便在数据量增长的情况下扩展计算和存储资源。The stages in the preferred embodiment of the present application have upstream and downstream dependencies on the pipeline, but different stages are only constrained by data format and decoupled from each other through the unified interface provided by the data platform, which can be independently developed. The algorithm of each stage can be flexibly replaced. By implementing a custom stage, a new process stage can be inserted between different stages to freely compile custom requirements. In addition, this application has no restrictions on the architecture of the data platform. For example, a Hadoop distributed file system or a cloud computing architecture may be used to facilitate expansion of computing and storage resources when the amount of data increases.
附图说明BRIEF DESCRIPTION
附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application. Furthermore, throughout the drawings, the same reference symbols are used to denote the same components. In the drawings:
图1为本申请知识图谱的数据融合方法第一实施例的流程示意图;FIG. 1 is a schematic flowchart of a first embodiment of a data fusion method for knowledge graph of the application;
图2为本申请知识图谱的数据融合方法第二实施例的流程示意图;2 is a schematic flowchart of a second embodiment of a data fusion method for knowledge graph of the application;
图3为本申请知识图谱的数据融合装置一实施例的结构示意图;3 is a schematic structural diagram of an embodiment of a data fusion device for knowledge graphs of the application;
图4为知识图谱的图数据模型示意图。Figure 4 is a schematic diagram of the graph data model of the knowledge graph.
具体实施方式detailed description
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。In order to make the above objects, features and advantages of the present application more obvious and understandable, the present application will be described in further detail below with reference to the accompanying drawings and specific embodiments.
在本申请的描述中,需要理解的是,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。“多个”的含义是两个或两个以上,除非另有明确具体的限定。术语“包括”、“包含”及类似术语应该被理解为是开放性的术语,即“包括/包含但不限于”。术语“基于”是“至少部分地基于”。术语“一实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”。其他术语的相关定义将在下文描述中给出。In the description of the present application, it should be understood that the terms “first” and “second” are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. The meaning of "plurality" is two or more, unless specifically defined otherwise. The terms "including", "including" and similar terms should be understood as open terms, ie "including/including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one other embodiment". Related definitions of other terms will be given in the description below.
参照图1,示出了本申请知识图谱的数据融合方法第一实施例的流程,执行本方法实施例的系统设置有为各阶段提供运行环境和计算资源的数据平台,各阶段均可以通过数据平台的统一访问接口实现交互。在具体实施时,数据平台可以构建在Hadoop分布式文件系统、云计算架构(如亚马逊AWS EMR)或其他架构上,对此,本申请不予限制。所述方法实施例包括以下几个阶段:Referring to FIG. 1, the flow of the first embodiment of the data fusion method of the knowledge graph of the present application is shown. The system implementing this method embodiment is provided with a data platform that provides an operating environment and computing resources for each stage, and each stage can pass data. The platform's unified access interface enables interaction. In specific implementation, the data platform can be built on the Hadoop distributed file system, cloud computing architecture (such as Amazon AWS EMR) or other architectures, this application is not limited. The method embodiment includes the following stages:
1.数据预处理阶段(InputStage):将多个同构或者异构数据源中(如结构化数据A和非结构化数据B)的数据处理成相同的实体及其属性的格式(SPO格式),作为后续阶段的输入。1. Data preprocessing stage (InputStage): processing data from multiple homogeneous or heterogeneous data sources (such as structured data A and unstructured data B) into the same entity and attribute format (SPO format) As an input to subsequent stages.
通过配置不同的数据源信息以及数据模型,将数据从数据源抽取、清洗、变形后以统一的数据格式在数据平台上存储。例如对于关系型数据库数据源,通过配置连接信息、实体类型和实体表、关系类型和关系表,就可以抽取出需 要的SPO数据。对于图数据库,节点(实体-属性-属性值)和边(实体-关系-实体)是天然的SPO结构。By configuring different data source information and data models, data is extracted, cleaned, and transformed from the data source and stored on the data platform in a unified data format. For example, for a relational database data source, by configuring connection information, entity types and entity tables, relationship types and relationship tables, you can extract the required SPO data. For graph databases, nodes (entity-attribute-attribute values) and edges (entity-relationship-entity) are natural SPO structures.
数据预处理阶段的部分配置参数如下表所示。Some configuration parameters in the data preprocessing stage are shown in the following table.
Figure PCTCN2019124552-appb-000001
Figure PCTCN2019124552-appb-000001
具体实施时,可以采用自定义(CustomInputStage)方式实现对不同数据源预处理,接口形式如下:In specific implementation, you can use a custom (CustomInputStage) method to preprocess different data sources, the interface form is as follows:
Figure PCTCN2019124552-appb-000002
Figure PCTCN2019124552-appb-000002
Figure PCTCN2019124552-appb-000003
Figure PCTCN2019124552-appb-000003
通过读取上表中定义的配置,实现读取远端数据后解析、存储数据。例如对非结构化数据源,可以调用机器学习接口、网络接口等完成知识提取,保存成三元组信息,返回保存数据的地址和元数据信息。By reading the configuration defined in the above table, the data can be parsed and stored after reading the remote data. For example, for unstructured data sources, you can call machine learning interfaces, network interfaces, etc. to complete knowledge extraction, save it as triple information, and return the address and metadata information of the saved data.
2.实体分区阶段(BlockingStage):将来自于多个数据源的实体根据其属性,划分到不同的子分区(Block),以降低候选匹配对的数据规模。2. Entity partitioning stage (BlockingStage): Entities from multiple data sources are divided into different sub-partitions (Blocks) according to their attributes to reduce the data size of candidate matching pairs.
对于需要匹配的数据源S和T,假设数据源S的实体数据规模是m,数据源T的实体数据规模是n,其需要检验匹配的数据规模是m*n。在大数据场景下,这个数据规模基本上是无法实现的,必须降低需要匹配的数据对规模。For the data sources S and T that need to be matched, it is assumed that the size of the entity data of the data source S is m and the size of the entity data of the data source T is n, and the size of the data that needs to be checked for matching is m*n. In a big data scenario, this data scale is basically unachievable, and the size of the data pairs that need to be matched must be reduced.
具体实施时,可以预先将两个数据源中不可能匹配的实体对划分到不同的数据分区中,使每个数据分区中的数据规模大大降低,多个数据分区可以并行计算完成。During specific implementation, entity pairs that are unlikely to match in the two data sources can be divided into different data partitions in advance, so that the data size in each data partition is greatly reduced, and multiple data partitions can be calculated in parallel.
例如,对于S和T中需要匹配的公司实体,一般注册在不同国家的实体在现实世界不太可能是同一家公司,那么可以根据公司的国家属性,划分为220多个(国家或地区)数据分区。对于每个分区,可以进一步根据相同或者相似属性,继续划分子分区。比如,同在‘美国’分区下面的公司,可以继续根据相同的‘州’属性分到新的分区。最后需要匹配的数据规模等于所有数据分区的和,在后续的计算中,所有的数据分区可以并行计算,从而可以较大程度地降低了整体匹配时间。For example, for corporate entities that need to be matched in S and T, entities that are generally registered in different countries are unlikely to be the same company in the real world, so it can be divided into more than 220 (national or regional) data according to the company's national attributes Partition. For each partition, the sub-partitions can be further divided according to the same or similar attributes. For example, companies that are under the'US' division can continue to be assigned to new divisions based on the same'state' attribute. Finally, the size of the data to be matched is equal to the sum of all data partitions. In subsequent calculations, all data partitions can be calculated in parallel, which can greatly reduce the overall matching time.
实体分区阶段的部分配置参数如下表所示。Some configuration parameters of the entity partitioning stage are shown in the following table.
Figure PCTCN2019124552-appb-000004
Figure PCTCN2019124552-appb-000004
Figure PCTCN2019124552-appb-000005
Figure PCTCN2019124552-appb-000005
另外,可以通过自定义的分区算法扩展实体分区阶段(BlockingStage)的分区方式,例如,通过如下接口形式:In addition, the partitioning method of the entity partitioning stage (BlockingStage) can be extended through a custom partitioning algorithm, for example, through the following interface form:
Figure PCTCN2019124552-appb-000006
Figure PCTCN2019124552-appb-000006
可以根据当前实体所在分区和下一次分区所用的属性产生全局唯一的分区键(block key),从而将数据分入下一个分区。当该分区的可能匹配实体对数达到最低值或者总分区数达到最大值时,该分区不再继续划分。A globally unique block key (block key) can be generated based on the attributes of the current entity's partition and the next partition to divide the data into the next partition. When the number of possible matching entity pairs of the partition reaches the lowest value or the total number of partitions reaches the maximum value, the partition is no longer divided.
对基于聚类的分区算法,可以利用已经训练好的聚类模型实现,接口形式如下:The clustering-based partitioning algorithm can be implemented using the already trained clustering model, and the interface form is as follows:
Figure PCTCN2019124552-appb-000007
Figure PCTCN2019124552-appb-000007
聚类模型可以对当前实体直接进行预测,并对应到某个类当中,此时分区数量等于聚类模型的类数量。当然还可以在聚类的基础上继续对分区划分。The clustering model can directly predict the current entity and correspond to a certain class. At this time, the number of partitions is equal to the number of classes in the clustering model. Of course, you can continue to divide the partition on the basis of clustering.
3.实体匹配阶段(MatchStage):对于同一分区内的候选实体对,可根据实体本身的属性以及与它有关联的实体的属性分别设置不同的权重,并通过加权求和计算该候选实体对的总体相似度;将超过一定相似度阈值的候选实体对筛选出,作为匹配实体对。3. Entity matching stage (MatchStage): For candidate entity pairs in the same partition, different weights can be set according to the attributes of the entity itself and the attributes of the entities associated with it, and the candidate entity pair is calculated by weighted summation Overall similarity; select candidate entity pairs that exceed a certain similarity threshold as matching entity pairs.
需要说明的是,本流程设计允许插入一些基于强关联的规则来直接完成匹配,如两个数据源中的公司数据,若其都是上市公司并且上市的股票代码完全相同,将可以被直接匹配,从而跳过相似度计算的流程,从而降低匹配阶段的计算复杂度。It should be noted that this process design allows the insertion of some rules based on strong associations to directly complete the matching, such as the company data in the two data sources, if they are both listed companies and the listed stock codes are exactly the same, they will be directly matched , Thereby skipping the similarity calculation process, thereby reducing the computational complexity of the matching phase.
当提供验证数据集时,可以通过匹配算法产生的结果与验证数据集进行对比,验证匹配算法的准确度。通过调整属性和权重参数,以及相似度阈值,多次对比计算结果,以不断提高准确度。例如两个公司实体通过名称和股票代码相似度加权和来比较,如果名称在不同数据源用不同语言表示,其相似度权重就较低,需要将其权重调低一些,而股票代码的相似度相对权重应该设的更高一些。When a verification data set is provided, the results generated by the matching algorithm can be compared with the verification data set to verify the accuracy of the matching algorithm. By adjusting the attributes and weight parameters, and the similarity threshold, the calculation results are compared multiple times to continuously improve accuracy. For example, two company entities are compared by the weighted sum of the similarity between the name and the stock symbol. If the name is expressed in different languages in different data sources, the similarity weight is lower, and the weight needs to be lowered, while the similarity of the stock symbol The relative weight should be set higher.
本申请的实体匹配算法可以通过调整参数多次迭代,以提高匹配结果的准确性。The entity matching algorithm of this application can adjust the parameters for multiple iterations to improve the accuracy of the matching results.
实体匹配阶段(MatchStage)的部分配置参数如下表所示。Some configuration parameters of the entity matching stage (MatchStage) are shown in the following table.
Figure PCTCN2019124552-appb-000008
Figure PCTCN2019124552-appb-000008
Figure PCTCN2019124552-appb-000009
Figure PCTCN2019124552-appb-000009
通过自定义的实体匹配算法,可以比较两个实体是否指向同一个知识表示。接口形式如下:Through a custom entity matching algorithm, you can compare whether two entities point to the same knowledge representation. The interface is as follows:
Figure PCTCN2019124552-appb-000010
Figure PCTCN2019124552-appb-000010
上面例子中,使用预先训练的机器学习二分类模型,以两个实体的各属性相似度向量作为输入,推断是否能归为同一个实体的概率(是则为1)。In the above example, a pre-trained machine learning binary classification model is used, and the attribute similarity vectors of the two entities are used as input to infer the probability of whether they can be classified as the same entity (yes, it is 1).
最后匹配的实体对将被输出到结果集合当中。The last matched entity pair will be output to the result set.
4.实体融合阶段(MergeStage):对实际指向同一实体的不同数据源中的数据,根据融合算法,对实体属性值进行补充、替换和规范化,最终生成统一的实体表示。4. Entity fusion stage (MergeStage): The data in different data sources that actually point to the same entity are supplemented, replaced, and normalized according to the fusion algorithm, and the unified entity representation is finally generated.
一般需要自定义的融合算法,接口形式如下:Generally requires a custom fusion algorithm, the interface form is as follows:
Figure PCTCN2019124552-appb-000011
Figure PCTCN2019124552-appb-000011
数据融合时可结合不同的业务规则实现,比如名称可设置多个匿名,邮箱、地址等可以采用标准化格式。而对缺失的属性数据可以通过爬虫或者人工进行填充,构建高质量的数据,方便知识图谱的搜索、分析等应用。Data fusion can be achieved by combining different business rules. For example, multiple anonymous names can be set, and standardized formats can be used for mailboxes and addresses. The missing attribute data can be filled by crawlers or manual to construct high-quality data, which is convenient for the search and analysis of knowledge graphs.
在进一步的实施例中,除了以上定义的几个阶段,还可以编排入不同功能 的阶段(如数据处理阶段)。可采用以下形式的接口:In a further embodiment, in addition to the stages defined above, stages of different functions (e.g., data processing stage) can also be arranged. The following forms of interface can be used:
Figure PCTCN2019124552-appb-000012
Figure PCTCN2019124552-appb-000012
对需要处理的数据通过input配置参数传递,处理完成后写入output,并传递给下一个阶段,实现系统功能的扩展。The data to be processed is passed through the input configuration parameters, and the output is written to the output after the processing is completed, and passed to the next stage to realize the expansion of the system functions.
本申请通过上述手段,实现了大数据场景下实体融合的通用流水线(Pipeline)。流水线由多个阶段(Stage)构成,每个阶段可以通过配置的方式灵活扩展,并且可以将自定义阶段(CustomStage)编排到流水线以适应不同的应用场景。除了数据预处理阶段(InputStage)只有output输出,其他各阶段都具有input输入配置。Input配置可指定该阶段运行需要获取的来自不同数据源的实体列表、关系列表、数据地址以及相关数据元信息(schema包括表名、列名等)。等到该阶段读取完输入数据,运行算法,写入到数据平台,并将所有数据地址和元数据通过output输出。因此各阶段可以通过input和output串联运行,也可以单独指定input参数运行。This application realizes a general pipeline (Pipeline) for entity fusion in a big data scenario through the above-mentioned means. The pipeline is composed of multiple stages (Stage), each stage can be flexibly expanded through configuration, and custom stages (CustomStage) can be arranged to the pipeline to adapt to different application scenarios. In addition to the data output stage (InputStage) only output, all other stages have input input configuration. Input configuration can specify the entity list, relationship list, data address and related data element information (schema including table name, column name, etc.) from different data sources that need to be obtained during this stage of operation. After this stage, the input data is read, the algorithm is run, written to the data platform, and all data addresses and metadata are output through the output. Therefore, each stage can be run in series through input and output, or it can be run individually by specifying input parameters.
参照图2,示出了本申请知识图谱的数据融合方法第二实施例的流程,与上述第一方法实施例的区别在于,在数据预处理阶段和实体分区阶段之间增加一个属性对齐阶段(Attribute Matching):用于将来自多个数据源的经预处理后存储在数据平台中的实体根据其属性的实际含义进行对齐,如将数据源A的『地址』字段与数据源B的『Address』字段进行对齐,在后续分区和匹配阶段中被对齐的字段将当作同一含义的字段来处理。2, the flow of the second embodiment of the data fusion method of the knowledge graph of the present application is shown. The difference from the above-mentioned first method embodiment is that an attribute alignment stage is added between the data preprocessing stage and the entity partitioning stage ( Attribute Matching): used to align the pre-processed entities from multiple data sources stored in the data platform according to the actual meaning of their attributes, such as the "Address" field of data source A and the "Address" field of data source B The fields are aligned, and the fields that are aligned in the subsequent partition and matching phases will be treated as fields with the same meaning.
具体实施时,实体属性的实际含义可以人工设定,也可以通过在系统中设置一个属性含义对照表的形式实现,对此,本申请不予限制。In the specific implementation, the actual meaning of the entity attribute can be set manually, or can be realized by setting an attribute meaning comparison table in the system, which is not limited in this application.
本申请还公开了一种在其上记录有用于执行上述方法的程序的存储介质。所述存储介质包括配置为以计算机(以计算机为例)可读的形式存储或传送信息的任何机制。例如,存储介质包括只读存储器(ROM)、随机存取存储器(RAM)、磁盘存储介质、光存储介质、闪速存储介质、电、光、声或其他形式的传播信号(例如,载波、红外信号、数字信号等)等。The present application also discloses a storage medium on which the program for executing the above method is recorded. The storage medium includes any mechanism configured to store or transfer information in a form readable by a computer (taking a computer as an example). For example, storage media include read-only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash storage media, electrical, optical, acoustic, or other forms of propagated signals (eg, carrier waves, infrared Signals, digital signals, etc.) etc.
参照图3,示出了本申请知识图谱的数据融合装置一实施例的结构框图,包括数据平台10、数据预处理模块11、实体分区模块12、实体匹配模块13和实体融合模块14,其中:Referring to FIG. 3, a structural block diagram of an embodiment of a data fusion device for knowledge graphs of the present application is shown, including a data platform 10, a data preprocessing module 11, an entity partitioning module 12, an entity matching module 13, and an entity fusion module 14, wherein:
数据平台10配置有统一访问接口,为其他模块提供计算和存储服务。本申请对数据平台的架构没有限制,为方便在数据量增长的情况下扩展计算和存储资源,可以采用Hadoop分布式文件系统或云计算架构。The data platform 10 is configured with a unified access interface to provide computing and storage services for other modules. This application has no restrictions on the architecture of the data platform. In order to facilitate the expansion of computing and storage resources when the amount of data increases, you can use the Hadoop distributed file system or cloud computing architecture.
数据预处理模块11用于将来自不同数据源的数据进行处理后转换为三元组(S-P-O)格式,通过所述统一访问接口存储到数据平台10,并接收数据平台10返回的图数据索引信息。其中,图数据索引信息可以是三元组格式的图数据在数据平台10的存储地址及其元数据。The data pre-processing module 11 is used to process the data from different data sources and convert it into a triplet (SPO) format, store it to the data platform 10 through the unified access interface, and receive the graph data index information returned by the data platform 10 . The graph data index information may be the storage address of the graph data in the triplet format in the data platform 10 and its metadata.
实体分区模块12用于根据所述图数据索引信息,通过所述统一访问接口将数据平台10中存储的实体按属性划分为一个或多个子分区。具体实施时,实体分区模块12可以包括根据实体属性产生的全局唯一分区键对存储在数据平台中的实体进行等值划分的等值分区子模块,基于预设聚类模型对存储在数据平台中的实体进行划分的聚类分区子模块,和/或其他分区方式的子模块。The entity partitioning module 12 is used to divide the entities stored in the data platform 10 into one or more sub-partitions according to attributes according to the graph data index information through the unified access interface. In specific implementation, the entity partitioning module 12 may include an equivalent partitioning sub-module for dividing the entities stored in the data platform by the globally unique partition key generated according to the attribute of the entity, and storing the data in the data platform based on the preset clustering model The entities are divided into clustering sub-modules, and/or sub-modules of other partitioning methods.
实体匹配模块13用于对划分到相同子分区中的候选实体对进行相似度计算,筛选出满足预设相似度条件的匹配实体对。The entity matching module 13 is configured to perform similarity calculation on candidate entity pairs divided into the same sub-partition, and filter out matching entity pairs that meet a preset similarity condition.
实体融合模块14用于对所述匹配实体对的实体属性值进行补充和/或替换,生成统一的实体表示。The entity fusion module 14 is used to supplement and/or replace entity attribute values of the matching entity pairs to generate a unified entity representation.
本申请装置实施例的各功能模块在流水线上具有上下游依赖关系,但不同模块之间仅通过数据格式约束,通过数据平台提供的统一接口实现相互解耦,可独立开发完成。各模块的算法本身可以灵活替换,通过实现自定义阶段,可以在不同模块之间插入新的模块,自由编制完成自定义需求。例如,为了提高对各种不同数据源的适应能力以及后续实体分区、匹配和融合的准确性,可以在数据预处理模块11和实体分区模块12之间插入属性对齐模块15,用于将来自不同数据源的经数据预处理模块11处理后存储在数据平台10中的实体根据其属性的实际含义进行对齐。如将数据源A的『地址』字段与数据源B的『Address』字段进行对齐,在后续分区和匹配阶段中被对齐的字段将当作同一含义的字段来处理。Each functional module of the device embodiment of the present application has upstream and downstream dependencies on the pipeline, but different modules are only constrained by data format and decoupled from each other through the unified interface provided by the data platform, which can be independently developed. The algorithm of each module can be flexibly replaced. By implementing the custom stage, new modules can be inserted between different modules to freely compile custom requirements. For example, in order to improve the adaptability to various data sources and the accuracy of subsequent entity partitioning, matching, and fusion, an attribute alignment module 15 may be inserted between the data preprocessing module 11 and the entity partitioning module 12, for The entities stored in the data platform 10 after being processed by the data preprocessing module 11 of the data source are aligned according to the actual meaning of their attributes. If the "Address" field of data source A is aligned with the "Address" field of data source B, the aligned fields in the subsequent partition and matching phases will be treated as fields of the same meaning.
在进一步的优选装置实施例中,实体匹配模块13具体可以包括相似度计算 子模块和比较子模块;其中的相似度计算子模块用于为实体本身的属性以及与该实体相关的其他实体的属性分别设置不同的权重,加权求和计算候选实体对的总体相似度;比较子模块用于判断相同子分区中的候选实体对的总体相似度是否超过预设相似度阈值,若是,则将该候选实体对作为匹配实体对。In a further preferred device embodiment, the entity matching module 13 may specifically include a similarity calculation sub-module and a comparison sub-module; the similarity calculation sub-module is used for attributes of the entity itself and attributes of other entities related to the entity Set different weights and weighted sum to calculate the overall similarity of the candidate entity pairs; the comparison submodule is used to determine whether the overall similarity of the candidate entity pairs in the same sub-region exceeds the preset similarity threshold, if so, the candidate Entity pairs are used as matching entity pairs.
在另一优选装置实施例中,所述装置可以还包括数据处理模块,用于通过所述统一访问接口对数据平台中的节点实体数据和边实体数据进行处理,并返回数据处理结果传递给下一个模块。In another preferred device embodiment, the device may further include a data processing module for processing node entity data and edge entity data in the data platform through the unified access interface, and returning the data processing result to the next One module.
上述数据处理模块可以采用以下形式实现:The above data processing module can be implemented in the following forms:
Figure PCTCN2019124552-appb-000013
Figure PCTCN2019124552-appb-000013
其中,对需要处理的数据通过input配置参数传递,数据处理完成后将结果写入output,并传递给下一个阶段的功能模块,实现装置功能的扩展。Among them, the data to be processed is transmitted through the input configuration parameters, and after the data processing is completed, the result is written to the output and passed to the function module in the next stage to realize the expansion of the device function.
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于本申请的装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例部分的说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,既可以位于一个地方或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。The embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments may refer to each other. For the device embodiments of the present application, since they are basically similar to the method embodiments, the description is relatively simple, and the relevant parts can be referred to the description in the method embodiments. The device embodiments described above are only schematic, wherein the modules described as separate components may or may not be physically separated, and may be located in one place or may be distributed on multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without paying creative labor.
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。In this article, specific examples are used to explain the principle and implementation of this application. The descriptions of the above examples are only used to help understand the method and core ideas of this application; meanwhile, for ordinary technicians in this field, according to this application The thoughts of this book may change in the specific implementation mode and application scope. In summary, the content of this specification should not be understood as a limitation to this application.

Claims (10)

  1. 一种知识图谱的数据融合方法,执行所述方法的知识图谱系统包括配置有统一访问接口的数据平台,所述方法包括:A data fusion method for a knowledge graph. The knowledge graph system that executes the method includes a data platform configured with a unified access interface. The method includes:
    将来自不同数据源的数据进行处理后转换为三元组格式,通过所述统一访问接口存储到数据平台,并接收所述数据平台返回的图数据索引信息;Processing data from different data sources into a triplet format, storing to a data platform through the unified access interface, and receiving graph data index information returned by the data platform;
    根据所述图数据索引信息,将所述数据平台中存储的实体按属性划分为一个或多个子分区;According to the graph data index information, the entities stored in the data platform are divided into one or more sub-partitions according to attributes;
    对划分到相同子分区中的候选实体对进行相似度计算,筛选出满足预设相似度条件的匹配实体对;Perform similarity calculation on candidate entity pairs divided into the same sub-division, and select matching entity pairs that meet preset similarity conditions;
    对所述匹配实体对的实体属性值进行补充和/或替换,生成统一的实体表示。Supplement and/or replace the entity attribute values of the matching entity pairs to generate a unified entity representation.
  2. 根据权利要求1所述的方法,其中,在步骤根据所述图数据索引信息,将所述数据平台中存储的实体按属性划分为一个或多个子分区之前,还包括:将来自多个数据源的数据转换为三元组格式之后存储在数据平台中的实体根据其属性的实际含义进行对齐。The method according to claim 1, wherein before the step of dividing the entities stored in the data platform into one or more sub-partitions according to the attributes according to the graph data index information, the method further includes: After converting the data into a triple format, the entities stored in the data platform are aligned according to the actual meaning of their attributes.
  3. 根据权利要求1所述的方法,其中,所述子分区划分方式为根据实体属性产生的全局唯一分区键进行等值划分,或基于预设聚类模型进行划分。The method according to claim 1, wherein the division method of the sub-partitions is to perform an equivalent division based on a globally unique partition key generated by an entity attribute, or to divide based on a preset clustering model.
  4. 根据权利要求1所述的方法,其中,对划分到相同子分区中的候选实体对进行相似度计算,筛选出满足预设相似度条件的匹配实体对,具体为:The method according to claim 1, wherein the similarity calculation is performed on the candidate entity pairs divided into the same sub-partition, and the matching entity pairs that meet the preset similarity condition are selected, specifically:
    为实体本身的属性以及与该实体相关的其他实体的属性分别设置不同的权重,加权求和计算候选实体对的总体相似度;Set different weights for the attributes of the entity itself and the attributes of other entities related to the entity, weighted summation to calculate the overall similarity of the candidate entity pairs;
    若相同子分区中的候选实体对的总体相似度超过预设相似度阈值,则将该候选实体对作为匹配实体对。If the overall similarity of candidate entity pairs in the same sub-partition exceeds a preset similarity threshold, the candidate entity pair is regarded as a matching entity pair.
  5. 根据权利要求1所述的方法,其中,对缺失的实体属性值进行补充的方法为通过爬虫从网络获取或进行人工填充。The method according to claim 1, wherein the method of supplementing the missing entity attribute value is to obtain it from the network through a crawler or perform manual filling.
  6. 根据权利要求1所述的方法,其中,所述图数据索引信息为三元组格式的图数据在所述数据平台的存储地址及其元数据。The method according to claim 1, wherein the graph data index information is a storage address of the graph data in the triplet format on the data platform and its metadata.
  7. 一种知识图谱的数据融合装置,包括数据平台、数据预处理模块、实体分区模块、实体匹配模块和实体融合模块,其中:A knowledge graph data fusion device includes a data platform, a data preprocessing module, an entity partition module, an entity matching module and an entity fusion module, where:
    所述数据平台包括统一访问接口;The data platform includes a unified access interface;
    所述数据预处理模块配置为将来自不同数据源的数据进行处理后转换为三元组格式,通过所述统一访问接口存储到数据平台,并接收所述数据平台返回的图数据索引信息;The data pre-processing module is configured to process data from different data sources and convert them into a triplet format, store them in a data platform through the unified access interface, and receive graph data index information returned by the data platform;
    所述实体分区模块配置为根据所述数据预处理模块输出的图数据索引信息,将所述数据平台中存储的实体按属性划分为一个或多个子分区;The entity partitioning module is configured to divide the entities stored in the data platform into one or more sub-partitions according to attributes according to the graph data index information output by the data preprocessing module;
    所述实体匹配模块配置为将所述实体分区模块划分到相同子分区中的候选实体对进行相似度计算,筛选出满足预设相似度条件的匹配实体对;The entity matching module is configured to divide the entity partition module into candidate entity pairs in the same sub-region to perform similarity calculation, and filter out matching entity pairs that meet a preset similarity condition;
    所述实体融合模块配置为对所述实体匹配模块筛选出的匹配实体对的实体属性值进行补充和/或替换,生成统一的实体表示。The entity fusion module is configured to supplement and/or replace the entity attribute values of the matching entity pairs selected by the entity matching module to generate a unified entity representation.
  8. 根据权利要求7所述的装置,其中,The device according to claim 7, wherein
    所述实体分区模块包括等值分区子模块和/或聚类分区子模块;The entity partitioning module includes an equivalent partitioning submodule and/or a clustering partitioning submodule;
    所述等值分区子模块配置为根据实体属性产生的全局唯一分区键对存储在数据平台中的实体进行等值划分;The equivalent partitioning submodule is configured to divide the entities stored in the data platform into equal values according to the globally unique partitioning key generated by the entity attributes;
    所述聚类分区子模块配置为基于预设聚类模型对存储在数据平台中的实体进行划分;The clustering sub-module is configured to divide entities stored in the data platform based on a preset clustering model;
    所述实体匹配模块具体包括相似度计算子模块和比较子模块;The entity matching module specifically includes a similarity calculation submodule and a comparison submodule;
    所述相似度计算子模块配置为为实体本身的属性以及与该实体相关的其他实体的属性分别设置不同的权重,加权求和计算候选实体对的总体相似度;The similarity calculation sub-module is configured to set different weights for attributes of the entity itself and attributes of other entities related to the entity, and weighted sum to calculate the overall similarity of the candidate entity pair;
    所述比较子模块配置为判断相同子分区中的候选实体对的总体相似度是否超过预设相似度阈值,若是,则将该候选实体对作为匹配实体对。The comparison sub-module is configured to determine whether the overall similarity of candidate entity pairs in the same sub-partition exceeds a preset similarity threshold, and if so, use the candidate entity pair as a matching entity pair.
  9. 根据权利要求7所述的装置,其中,所述装置还包括数据处理模块和/或属性对齐模块;The device according to claim 7, wherein the device further comprises a data processing module and/or an attribute alignment module;
    所述数据处理模块配置为通过所述统一访问接口对数据平台中的节点实体数据和边实体数据进行处理,并返回数据处理结果传递给下一个模块;The data processing module is configured to process node entity data and edge entity data in the data platform through the unified access interface, and return the data processing result to the next module;
    所述属性对齐模块配置为将来自多个数据源的数据经所述数据预处理模块处理后存储在数据平台中的实体根据其属性的实际含义进行对齐。The attribute alignment module is configured to align the entities stored in the data platform after the data preprocessing module processes the data from multiple data sources according to the actual meaning of their attributes.
  10. 一种存储介质,所述存储介质存储有配置为执行权利要求1~6任一所述的方法的程序。A storage medium storing a program configured to perform the method according to any one of claims 1 to 6.
PCT/CN2019/124552 2018-12-29 2019-12-11 Data merging method and apparatus for knowledge graph WO2020135048A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811635696.XA CN109739939A (en) 2018-12-29 2018-12-29 The data fusion method and device of knowledge mapping
CN201811635696.X 2018-12-29

Publications (1)

Publication Number Publication Date
WO2020135048A1 true WO2020135048A1 (en) 2020-07-02

Family

ID=66362378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124552 WO2020135048A1 (en) 2018-12-29 2019-12-11 Data merging method and apparatus for knowledge graph

Country Status (2)

Country Link
CN (1) CN109739939A (en)
WO (1) WO2020135048A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699252A (en) * 2021-03-25 2021-04-23 成都数联铭品科技有限公司 Processing method of attribute data applied to knowledge graph and electronic equipment

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783582B (en) * 2018-12-04 2023-08-15 平安科技(深圳)有限公司 Knowledge base alignment method, device, computer equipment and storage medium
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN110427415A (en) * 2019-08-02 2019-11-08 泰康保险集团股份有限公司 Knowledge share method, device, system media and electronic equipment
CN110532304B (en) * 2019-09-06 2020-11-24 京东城市(北京)数字科技有限公司 Data processing method and device, computer readable storage medium and electronic device
CN110580294B (en) * 2019-09-11 2022-11-29 腾讯科技(深圳)有限公司 Entity fusion method, device, equipment and storage medium
CN110704635B (en) * 2019-09-16 2023-12-12 金色熊猫有限公司 Method and device for converting triplet data in knowledge graph
CN110598072B (en) * 2019-09-24 2022-03-01 恩亿科(北京)数据科技有限公司 Feature data aggregation method and device
CN111046186A (en) * 2019-10-30 2020-04-21 平安科技(深圳)有限公司 Entity alignment method, device and equipment of knowledge graph and storage medium
CN110826316B (en) * 2019-11-06 2021-08-10 北京交通大学 Method for identifying sensitive information applied to referee document
CN111026874A (en) * 2019-11-22 2020-04-17 海信集团有限公司 Data processing method and server of knowledge graph
CN110929105B (en) * 2019-11-28 2022-11-29 广东云徙智能科技有限公司 User ID (identity) association method based on big data technology
CN111125376B (en) * 2019-12-23 2023-08-29 秒针信息技术有限公司 Knowledge graph generation method and device, data processing equipment and storage medium
CN111475653B (en) * 2019-12-30 2021-03-02 北京国双科技有限公司 Method and device for constructing knowledge graph in oil and gas exploration and development field
CN111291196B (en) * 2020-01-22 2024-03-22 腾讯科技(深圳)有限公司 Knowledge graph perfecting method and device, and data processing method and device
CN111444351B (en) * 2020-03-24 2023-09-12 清华苏州环境创新研究院 Knowledge graph construction method and device in industrial process field
CN111597239B (en) * 2020-04-10 2021-08-31 中科驭数(北京)科技有限公司 Data alignment method and device
CN111522803B (en) * 2020-04-14 2023-05-19 北京仁科互动网络技术有限公司 Tenant interaction method and device of software service platform and electronic equipment
CN111563133A (en) * 2020-05-06 2020-08-21 支付宝(杭州)信息技术有限公司 Method and system for data fusion based on entity relationship
CN112182330A (en) * 2020-09-23 2021-01-05 创新奇智(成都)科技有限公司 Knowledge graph construction method and device, electronic equipment and computer storage medium
CN112906826A (en) * 2021-03-30 2021-06-04 平安科技(深圳)有限公司 Multi-dimension-based knowledge graph fusion method and device and computer equipment
CN113297213B (en) * 2021-04-29 2023-09-12 军事科学院系统工程研究院网络信息研究所 Dynamic multi-attribute matching method for entity object
CN113392227B (en) * 2021-05-31 2024-04-19 交控科技股份有限公司 Metadata knowledge graph engine system oriented to rail transit field
CN113760995A (en) * 2021-09-09 2021-12-07 上海明略人工智能(集团)有限公司 Entity linking method, system, equipment and storage medium
CN113901264A (en) * 2021-11-12 2022-01-07 央视频融媒体发展有限公司 Method and system for matching periodic entities among movie and television attribute data sources
CN113934866B (en) * 2021-12-17 2022-03-08 鲁班(北京)电子商务科技有限公司 Commodity entity matching method and device based on set similarity
CN114282073B (en) * 2022-03-02 2022-07-15 支付宝(杭州)信息技术有限公司 Data storage method and device and data reading method and device
CN114896363B (en) * 2022-04-19 2023-03-28 北京月新时代科技股份有限公司 Data management method, device, equipment and medium
CN115577318B (en) * 2022-09-30 2023-07-21 北京大数据先进技术研究院 Semi-physical-based data fusion evaluation method, system, equipment and storage medium
CN117556058A (en) * 2024-01-11 2024-02-13 安徽大学 Knowledge graph enhanced network embedded author name disambiguation method and device
CN117725555A (en) * 2024-02-08 2024-03-19 暗物智能科技(广州)有限公司 Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150142829A1 (en) * 2013-11-18 2015-05-21 Fujitsu Limited System, apparatus, program and method for data aggregatione
CN105956015A (en) * 2016-04-22 2016-09-21 四川中软科技有限公司 Service platform integration method based on big data
CN107545046A (en) * 2017-08-17 2018-01-05 北京奇安信科技有限公司 A kind of fusion method and device of multi-source heterogeneous data
CN107958086A (en) * 2017-12-18 2018-04-24 北京睿力科技有限公司 The multi-source heterogeneous database data for solving data semantic Heterogeneity integrates method
CN109033129A (en) * 2018-06-04 2018-12-18 桂林电子科技大学 Multi-source Information Fusion knowledge mapping based on adaptive weighting indicates learning method
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10303999B2 (en) * 2011-02-22 2019-05-28 Refinitiv Us Organization Llc Machine learning-based relationship association and related discovery and search engines
CN107145523B (en) * 2017-04-12 2019-10-18 浙江大学 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching
CN108647318A (en) * 2018-05-10 2018-10-12 北京航空航天大学 A kind of knowledge fusion method based on multi-source data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150142829A1 (en) * 2013-11-18 2015-05-21 Fujitsu Limited System, apparatus, program and method for data aggregatione
CN105956015A (en) * 2016-04-22 2016-09-21 四川中软科技有限公司 Service platform integration method based on big data
CN107545046A (en) * 2017-08-17 2018-01-05 北京奇安信科技有限公司 A kind of fusion method and device of multi-source heterogeneous data
CN107958086A (en) * 2017-12-18 2018-04-24 北京睿力科技有限公司 The multi-source heterogeneous database data for solving data semantic Heterogeneity integrates method
CN109033129A (en) * 2018-06-04 2018-12-18 桂林电子科技大学 Multi-source Information Fusion knowledge mapping based on adaptive weighting indicates learning method
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699252A (en) * 2021-03-25 2021-04-23 成都数联铭品科技有限公司 Processing method of attribute data applied to knowledge graph and electronic equipment
CN112699252B (en) * 2021-03-25 2021-07-23 成都数联铭品科技有限公司 Processing method of attribute data applied to knowledge graph and electronic equipment

Also Published As

Publication number Publication date
CN109739939A (en) 2019-05-10

Similar Documents

Publication Publication Date Title
WO2020135048A1 (en) Data merging method and apparatus for knowledge graph
CN107391677B (en) Method and device for generating Chinese general knowledge graph with entity relation attributes
WO2021083239A1 (en) Graph data query method and apparatus, and device and storage medium
Cohen et al. Learning to match and cluster large high-dimensional data sets for data integration
WO2021174783A1 (en) Near-synonym pushing method and apparatus, electronic device, and medium
CN107885874A (en) Data query method and apparatus, computer equipment and computer-readable recording medium
CN111753099A (en) Method and system for enhancing file entity association degree based on knowledge graph
CN104317801B (en) A kind of Data clean system and method towards big data
CN104866471B (en) A kind of example match method based on local sensitivity Hash strategy
US10747958B2 (en) Dependency graph based natural language processing
KR102046692B1 (en) Method and System for Entity summarization based on multilingual projected entity space
JP2017208015A (en) Update device, update method, and update program
CN114218472A (en) Intelligent search system based on knowledge graph
US10997218B2 (en) Method and system for managing associations between entity records
Bo et al. Entity resolution acceleration using Micron’s Automata Processor
CN115438274A (en) False news identification method based on heterogeneous graph convolutional network
TW202123026A (en) Data archiving method, device, computer device and storage medium
Benny et al. Hadoop framework for entity resolution within high velocity streams
CN106933844B (en) Construction method of reachability query index facing large-scale RDF data
CN111125199B (en) Database access method and device and electronic equipment
US11720563B1 (en) Data storage and retrieval system for a cloud-based, multi-tenant application
CN111984745A (en) Dynamic expansion method, device, equipment and storage medium for database field
CN106682107B (en) Method and device for determining incidence relation of database table
Xia et al. Content-irrelevant tag cleansing via bi-layer clustering and peer cooperation
CN115509497A (en) Visual business rule engine construction method based on script language

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19904173

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19904173

Country of ref document: EP

Kind code of ref document: A1