WO2020143184A1 - Knowledge fusion method and apparatus, computer device, and storage medium - Google Patents

Knowledge fusion method and apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2020143184A1
WO2020143184A1 PCT/CN2019/092597 CN2019092597W WO2020143184A1 WO 2020143184 A1 WO2020143184 A1 WO 2020143184A1 CN 2019092597 W CN2019092597 W CN 2019092597W WO 2020143184 A1 WO2020143184 A1 WO 2020143184A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
knowledge
entity
attribute
word vector
Prior art date
Application number
PCT/CN2019/092597
Other languages
French (fr)
Chinese (zh)
Inventor
孙佳兴
胡逸凡
陈泽晖
黄鸿顺
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020143184A1 publication Critical patent/WO2020143184A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of knowledge graph technology, and in particular, to a knowledge fusion method, device, computer equipment, and storage medium.
  • knowledge data is composed of three parts, namely: entity information, relationship information and attribute information.
  • entity information entity information
  • relationship information relationship information
  • attribute information attribute information
  • Knowledge fusion refers to the discovery of different expressions of the same concept in heterogeneous databases. It organizes and manages distributed data sources and knowledge sources, and transforms, integrates, and integrates knowledge elements in accordance with application requirements to obtain valuable information. Or available new knowledge, at the same time optimize the structure and connotation of knowledge objects, and provide knowledge-based services.
  • the research of knowledge fusion has certain value for knowledge sharing, knowledge system interaction, integration and collaborative work, and optimization of knowledge service quality in the distributed knowledge base environment. It is also useful for researching knowledge discovery based on knowledge connotation and creation of new knowledge. , Organization, evaluation and optimization are of great significance.
  • a method of knowledge fusion including:
  • a knowledge fusion device includes the following modules:
  • the data acquisition module is set to acquire several knowledge data from the source of knowledge data
  • the vector generation module is configured to extract entity data from any of the knowledge data, convert the entity data into vectors, and generate multidimensional word vectors;
  • a data vectorization module configured to reduce the dimension of the multi-dimensional word vector to obtain a two-dimensional word vector, multiply the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, and the entity The elements in the data matrix are vectorized entity data;
  • the attribute value obtaining module is set to extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain the attribute value of the real attribute data;
  • the fusion determination module is configured to input the elements in the entity data matrix and the attribute values of the real attribute data into the credibility recognition model, and after obtaining the parameters, obtain the credibility of the knowledge data, and then convert the The reliability is compared with a preset reliability threshold, and if it is greater than the reliability threshold, the extracted original attribute data is fused, otherwise it is not fused.
  • a computer device includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the steps of the knowledge fusion method are caused.
  • a storage medium storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, causes the one or more processors to perform the steps of the above knowledge fusion method.
  • the above knowledge fusion method, device, computer equipment, and storage medium include: acquiring several pieces of knowledge data from a source of knowledge data; extracting entity data from any of the knowledge data, and vectorizing the entity data to generate multidimensional Word vector; reducing the dimension of the multi-dimensional word vector to obtain a two-dimensional word vector, and transposing the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix.
  • the elements are vectorized entity data; extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain the attribute values of the real attribute data; convert the entity data
  • the elements in the matrix and the attribute values of the real attribute data are entered into the credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, and the credibility is compared with a preset credibility threshold In comparison, if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused.
  • FIG. 1 is an overall flowchart of a knowledge fusion method in an embodiment of this application
  • FIG. 2 is a schematic diagram of a data acquisition process of a knowledge fusion method in an embodiment of the present application
  • FIG. 3 is a schematic diagram of a vector generation process of a knowledge fusion method in an embodiment of the present application
  • FIG. 4 is a structural diagram of a knowledge fusion device in an embodiment of the present application.
  • FIG. 1 is an overall flowchart of a knowledge fusion method in an embodiment of the present application. As shown in FIG. 1, a knowledge fusion method includes:
  • the knowledge data in this step may come from the same knowledge data source, or may come from different data sources, may come from local data, or may come from network data. If it comes from local data, you need to get the storage path of the knowledge data when you get the knowledge data; if you come from the network data source, you need to get the network address of the knowledge data source when you get the knowledge data.
  • the entity name list stored in the database is obtained, at least one entity name in the entity name list is randomly extracted, and entity data is extracted from the knowledge data according to the entity name.
  • entity data is extracted from the knowledge data according to the entity name.
  • the entity names of the entity data are "soccer", "volleyball” and other ball sports nouns.
  • PCA can be used to reduce the dimension of multidimensional word vectors. For example, if there are m pieces of n-dimensional data, the following steps can be used to reduce the dimension:
  • the matrix Q is the data after dimensionality reduction to k-dimension.
  • the original attribute data can be divided into several sub-data segments, and then the attribute word query can be performed on the data in each sub-data segment. If there is no attribute word, the sub-data segment is cleared.
  • the elements in the entity data matrix and the attribute values of the real attribute data are entered into a credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, and the credibility is predicted
  • the set credibility threshold is compared, and if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused.
  • credibility is reliability, which refers to the degree of consistency of the results obtained when the same method is repeatedly measured on the same object.
  • credibility refers to the reliability of measured data.
  • the preset credibility threshold is obtained based on historical data statistics, and the general credibility threshold is set to 95%.
  • FIG. 2 is a schematic diagram of a data acquisition process of a knowledge fusion method in an embodiment of the present application. As shown in the figure, the S1 acquires several knowledge data from a source of knowledge data, including:
  • the network address of the source of the knowledge data of the knowledge data to be extracted is obtained, and the type of the network address is determined according to the format of the network address, that is, whether the network address is a static IP address or a dynamic IP address, if it is a static IP Address, then retrieve the IP address table from the database for comparison, determine whether the static IP address is on the IP address table, send a knowledge data acquisition instruction when it is, and not send it if it is not; if it is a dynamic IP address, then Performing DNS resolution on the dynamic IP address to obtain a DNS resolution code corresponding to the dynamic IP address, and then calling a DNS resolution code table in a database to compare the DNS resolution code to determine whether the DNS resolution code is in the On the DNS resolution code table, the knowledge data acquisition instruction is sent when it is not, and it is not sent when it is not.
  • the format of the network address that is, whether the network address is a static IP address or a dynamic IP address, if it is a static IP Address, then retrieve the IP address table from the database
  • S102 Receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the type of the knowledge source data source according to the keywords;
  • the formal keyword refers to whether the knowledge data is structured data, semi-structured data, or unstructured data.
  • the keyword of the form “table” corresponds to structured data
  • the keyword of the form “webpage” corresponds to semi-structured data
  • the key of the form “text” appears Words correspond to unstructured data.
  • S103 Acquire an extraction method corresponding to the type of the knowledge data source, and extract several knowledge data of the knowledge data source according to the extraction method.
  • different forms of data sources correspond to different data extraction methods.
  • semi-structured web page information is usually crawled by web crawlers.
  • unstructured text text language is usually used for extraction.
  • the data form of the source of knowledge data is determined, so that the knowledge data of the source of knowledge data can be extracted by using the correct extraction method.
  • FIG. 3 is a schematic diagram of a vector generation process of a knowledge fusion method in an embodiment of the present application.
  • the S2 extracts any entity data in the knowledge data, and vectors the entity data Conversion to generate multidimensional word vectors, including:
  • the length of the initial segment of the entity data is set according to the historical data of the length value of the entity words in the entity data. For example, in the historical data stored in the database, the length of the entity word is from 1 to 10, then the length of the initial segment is set to a maximum value of 10.
  • the length of each sub-data block may be inconsistent, that is, the length of each sub-data block is determined according to the length of actual entity words.
  • Extract the entity data in the final sub-data block extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final
  • the segment length of the sub-data block is used as a coefficient to multiply the initial multi-dimensional word vector to obtain the final multi-dimensional word vector.
  • semantic features include elements such as semantics, grammar, and structure.
  • the word vector conversion method usually uses the Word2Vector algorithm. This algorithm can link each semantic feature up and down, thereby transforming the related semantic features into an initial multi-dimensional word vector.
  • the entity data is numerically represented, which is convenient for using a machine learning method to perform similarity calculation.
  • the two-dimensional word vector is reduced to obtain a two-dimensional word vector, and the two-dimensional word vector is transposed and the original two-dimensional word vector is multiplied to obtain an entity data matrix,
  • the elements in the entity data matrix are vectorized entity data, including:
  • sample points refer to each point in the multi-dimensional vector; each sample point in the multi-dimensional space N has directly connected points on the same plane, these points become nearest neighbors, and the value range of K is 1 ⁇ n, n is a non-zero positive integer.
  • the local weight matrix W i ⁇ w i1 ,w i2 ,...,w iK ⁇ of each sample point is established;
  • each sample point is mapped to a low-dimensional space, and the mapping conditions are:
  • ⁇ (Y) is the value of the loss function
  • y ij is the value of the neighbor
  • y n is the output vector of the neighbor
  • W ij is the element in the local weight matrix
  • K is the number of neighbors
  • N is the output vector of the neighbor
  • discretization refers to the mapping of finite individuals in infinite space into a limited space to improve the space-time efficiency of the algorithm.
  • the unique() function is developed by C++, PHP, Matlab, etc. or
  • the deduplication function supported by the scientific computing environment is used to remove duplicate values in a set, or take a single value from a set.
  • the vector dimension of the original attribute data is equal to the quantity of the original attribute data.
  • the original attribute data is real attribute data, if the difference is not within the error threshold, based on The difference value removes redundant attribute data in the original attribute data to obtain the real attribute data;
  • the characteristic value of the real attribute data matrix is obtained, and the characteristic value is the attribute value.
  • a two-dimensional attribute vector can be obtained, the two-dimensional attribute vector can be transposed to obtain a transposed two-dimensional attribute vector, and the product of the two-dimensional attribute vector and the transposed two-dimensional attribute vector can be multiplied. Then get the real attribute vector.
  • the real attribute value is better obtained.
  • the elements in the entity data matrix and the attribute values of the real attribute data are entered into a credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, Comparing the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise not fused, including:
  • L(m 1 , m 2 ) is the similarity distance function, m 1 is the element, m 2 is the attribute value;
  • Crd(m) is a credibility function
  • L(m 1 , m 2 ) is a similarity distance function
  • the cosine algorithm or the Euclidean distance algorithm can also be used in the similarity calculation.
  • the credibility threshold is obtained based on historical data statistics.
  • an extraction method corresponding to the type of the knowledge data source is obtained, and extracting several pieces of knowledge data of the knowledge data source according to the extraction method includes:
  • extracting using a web crawler tool includes:
  • Keyword group in the task queue for obtaining pre-extracted knowledge data the keyword group contains multiple keywords; among them, the keyword group in the task queue may be some trait phrases, such as: "ball”, in this
  • the keywords included under the keyword group may include “basketball”, “football”, “table tennis” and so on.
  • the entity information refers to the information related to the "entity" such as the name of the entity.
  • entity information When imported into the preset knowledge data table, the entity name in the preset knowledge data table is retrieved first, if a certain entity information If the entity name in is not in the preset knowledge data table, the entity information cannot be imported.
  • the preset knowledge data table is stored in the database, which is collected after collecting all previous knowledge data.
  • the required knowledge data can be effectively extracted from the web page information, and the efficiency of knowledge data extraction can be improved.
  • a knowledge fusion device As shown in FIG. 4, it includes the following modules:
  • the data acquisition module 41 is configured to acquire several knowledge data from the source of knowledge data;
  • the vector generation module 42 is configured to extract entity data from any of the knowledge data, convert the entity data into vectors, and generate multidimensional word vectors;
  • the data vectorization module 43 is configured to obtain a two-dimensional word vector after reducing the dimension of the multi-dimensional word vector, and multiply the two-dimensional word vector with the original two-dimensional word vector to obtain an entity data matrix.
  • the elements in the entity data matrix are vectorized entity data;
  • the attribute value obtaining module 44 is set to extract original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain attribute values of the real attribute data;
  • the fusion determination module 45 is configured to input the elements in the entity data matrix and the attribute values of the real attribute data into the credibility recognition model, obtain the credibility of the knowledge data after the parameters are obtained, and convert the The credibility is compared with a preset credibility threshold, and if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused.
  • the vector generation module is further set to:
  • the data acquisition module is further set to:
  • the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data
  • the segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.
  • the data vectorization module is further set to:
  • the elements in the entity data matrix are vectorized entity data.
  • the attribute value acquisition module is also set to:
  • the characteristic value of the real attribute data matrix is obtained, and the characteristic value is the attribute value.
  • the fusion determination module is further configured to:
  • the vector generation module is further set to:
  • the keyword group contains multiple keywords; traverse the keyword group, crawl a webpage corresponding to each keyword in the keyword group through a web crawler Information; obtain all the entity information in the information on the web page, import the entity information into the preset knowledge data table, if there is one or more entity information cannot be imported into the preset knowledge data table, Then crawl the webpage through the web crawler again, otherwise the webpage information is used as the knowledge data.
  • a computer device in one embodiment, includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the computer device executes the steps of the knowledge fusion method described in the above embodiments.
  • a storage medium storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute the above-mentioned embodiments. Describe the steps of the knowledge fusion method.
  • the storage medium may be a non-volatile storage medium.
  • the program may be stored in a computer-readable storage medium, and the storage medium may include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

Abstract

The present application relates to the technical field of knowledge graphs, in particular, to a knowledge fusion method and apparatus, a computer device, and a storage medium. The method comprises: obtaining multiple pieces of knowledge data in a knowledge data source; extracting entity data in any knowledge data, and performing vectorization conversion on the entity data to generate a multi-dimensional word vector; performing dimensionality reduction on the multi-dimensional word vector to obtain a two-dimensional word vector, transposing the two-dimensional word vector and then multiplying by the original two-dimensional word vector to obtain an entity data matrix, elements in the entity data matrix being vectorized entity data; obtaining an attribute value of real attribute data; and inputting the elements in the entity data matrix and the attribute value of the real attribute data as parameters into a credibility identification model, obtaining the credibility of the knowledge data after parameter output, comparing the credibility with a preset credibility threshold value, and performing fusion. The present application achieves effective fusion of multiple attributes in the same entity.

Description

知识融合方法、装置、计算机设备和存储介质Knowledge fusion method, device, computer equipment and storage medium
本申请要求于2019年1月11日提交中国专利局、申请号为201910025114.4、发明名称为“知识融合方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on January 11, 2019, with the application number 201910025114.4 and the invention titled "Knowledge Fusion Methods, Devices, Computer Equipment, and Storage Media", the entire contents of which are incorporated by reference In this application.
技术领域Technical field
本申请涉及知识图谱技术领域,尤其涉及一种知识融合方法、装置、计算机设备和存储介质。The present application relates to the field of knowledge graph technology, and in particular, to a knowledge fusion method, device, computer equipment, and storage medium.
背景技术Background technique
在当今互联网上存在着大量的知识,在各个网页包含的数据信息中存在各种样式的知识数据。其中,知识数据有三个部分组成,分别是:实体信息、关系信息和属性信息。在对知识数据进行梳理时,需要对知识数据进行融合,这一过程称为知识融合。There is a lot of knowledge on the Internet today, and there are various styles of knowledge data in the data information contained in each web page. Among them, knowledge data is composed of three parts, namely: entity information, relationship information and attribute information. When sorting out knowledge data, it is necessary to fuse knowledge data. This process is called knowledge fusion.
知识融合,是指发现异构数据库中相同概念的不同表达,它通过对分布式数据源和知识源进行组织和管理,结合应用需求对知识元素进行转化、集成和融合等处理,从而获取有价值或可用的新知识,同时对知识对象的结构和内涵进行优化,提供基于知识的服务。知识融合的研究对于分布式知识库环境中的知识共享、知识系统的交互、集成和协同工作、知识服务质量的优化等方面具有一定的价值,对于研究基于知识内涵的知识发现以及新知识的创建、组织、评价和优化等方面具有相当重要的意义。Knowledge fusion refers to the discovery of different expressions of the same concept in heterogeneous databases. It organizes and manages distributed data sources and knowledge sources, and transforms, integrates, and integrates knowledge elements in accordance with application requirements to obtain valuable information. Or available new knowledge, at the same time optimize the structure and connotation of knowledge objects, and provide knowledge-based services. The research of knowledge fusion has certain value for knowledge sharing, knowledge system interaction, integration and collaborative work, and optimization of knowledge service quality in the distributed knowledge base environment. It is also useful for researching knowledge discovery based on knowledge connotation and creation of new knowledge. , Organization, evaluation and optimization are of great significance.
目前,在进行知识融合的过程中存在着不能对属性进行精准判断,从而导致在融合的过程中,无法将属于同一实体的多个属性的有效进行合并的问题。At present, in the process of knowledge fusion, there is a problem that the attribute cannot be accurately judged, so that in the process of fusion, it is impossible to effectively merge multiple attributes belonging to the same entity.
发明内容Summary of the invention
基于此,有必要针对无法将属于同一实体的多个属性的有效进行合并的问题,提供一种知识融合方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide a knowledge fusion method, device, computer equipment, and storage medium for the problem that multiple attributes belonging to the same entity cannot be effectively merged.
一种知识融合方法,包括:A method of knowledge fusion, including:
获取知识数据来源中的数个知识数据;抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值;将所述实体数据矩阵中的元素和 所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。Obtaining several pieces of knowledge data from the source of knowledge data; extracting entity data from any of the knowledge data, vectorizing the entity data to generate multidimensional word vectors; reducing the dimension of the multidimensional word vectors to obtain two Dimension word vector, multiplying the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, the elements in the entity data matrix are vectorized entity data; extract any of the knowledge data The original attribute data in the original attribute data is filtered to obtain real attribute data, and the attribute values of the real attribute data are obtained; the elements in the entity data matrix and the attribute values of the real attribute data are entered into the parameter Go to the credibility recognition model, get the credibility of the knowledge data after taking out the parameters, compare the credibility with the preset credibility threshold, if it is greater than the credibility threshold, it will be extracted The original attribute data is fused, otherwise it is not fused.
一种知识融合装置,包括如下模块:A knowledge fusion device includes the following modules:
数据获取模块,设置为获取知识数据来源中的数个知识数据;The data acquisition module is set to acquire several knowledge data from the source of knowledge data;
向量生成模块,设置为抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;The vector generation module is configured to extract entity data from any of the knowledge data, convert the entity data into vectors, and generate multidimensional word vectors;
数据向量化模块,设置为将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;A data vectorization module, configured to reduce the dimension of the multi-dimensional word vector to obtain a two-dimensional word vector, multiply the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, and the entity The elements in the data matrix are vectorized entity data;
属性值获取模块,设置为抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值;The attribute value obtaining module is set to extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain the attribute value of the real attribute data;
融合判定模块,设置为将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。The fusion determination module is configured to input the elements in the entity data matrix and the attribute values of the real attribute data into the credibility recognition model, and after obtaining the parameters, obtain the credibility of the knowledge data, and then convert the The reliability is compared with a preset reliability threshold, and if it is greater than the reliability threshold, the extracted original attribute data is fused, otherwise it is not fused.
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述知识融合方法的步骤。A computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the steps of the knowledge fusion method are caused.
一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述知识融合方法的步骤。A storage medium storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, causes the one or more processors to perform the steps of the above knowledge fusion method.
上述知识融合方法、装置、计算机设备和存储介质,包括:获取知识数据来源中的数个知识数据;抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值;将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。本技术方案通过对实体和属性的精确匹配, 实现了同一实体的多个属性的有效融合。The above knowledge fusion method, device, computer equipment, and storage medium include: acquiring several pieces of knowledge data from a source of knowledge data; extracting entity data from any of the knowledge data, and vectorizing the entity data to generate multidimensional Word vector; reducing the dimension of the multi-dimensional word vector to obtain a two-dimensional word vector, and transposing the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix. The elements are vectorized entity data; extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain the attribute values of the real attribute data; convert the entity data The elements in the matrix and the attribute values of the real attribute data are entered into the credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, and the credibility is compared with a preset credibility threshold In comparison, if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused. The technical solution realizes the effective fusion of multiple attributes of the same entity through accurate matching of the entities and attributes.
附图说明BRIEF DESCRIPTION
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application.
图1为本申请在一个实施例中的一种知识融合方法的整体流程图;FIG. 1 is an overall flowchart of a knowledge fusion method in an embodiment of this application;
图2为本申请在一个实施例中的一种知识融合方法的数据获取过程示意图;2 is a schematic diagram of a data acquisition process of a knowledge fusion method in an embodiment of the present application;
图3为本申请在一个实施例中的一种知识融合方法的向量生成过程示意图;3 is a schematic diagram of a vector generation process of a knowledge fusion method in an embodiment of the present application;
图4为本申请在一个实施例中的一种知识融合装置的结构图。FIG. 4 is a structural diagram of a knowledge fusion device in an embodiment of the present application.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。Those skilled in the art can understand that unless specifically stated, the singular forms "a", "an", "said" and "the" used herein may also include the plural forms. It should be further understood that the word "comprising" used in the description of this application refers to the presence of the described features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and/or their groups.
图1为本申请在一个实施例中的一种知识融合方法的整体流程图,如图1所示,一种知识融合方法,包括:FIG. 1 is an overall flowchart of a knowledge fusion method in an embodiment of the present application. As shown in FIG. 1, a knowledge fusion method includes:
S1,获取知识数据来源中的数个知识数据;S1, obtaining several knowledge data from the source of knowledge data;
具体的,本步骤中的知识数据可以来自于同一知识数据来源,也可以来自于不同数据来源,可以来自于本地数据,也可以来自网络数据。若来自于本地数据,则在获取知识数据时,需要获得知识数据的存储路径;若来自于网络数据来源,则在获取知识数据时,需要获得知识数据来源的网络地址。Specifically, the knowledge data in this step may come from the same knowledge data source, or may come from different data sources, may come from local data, or may come from network data. If it comes from local data, you need to get the storage path of the knowledge data when you get the knowledge data; if you come from the network data source, you need to get the network address of the knowledge data source when you get the knowledge data.
S2,抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;S2, extracting entity data in any of the knowledge data, vectorizing the entity data, and generating a multi-dimensional word vector;
具体的,获取存储在数据库中的实体名称列表,随机抽取实体名称列表中的至少一个实体名称,根据所述实体名称从所述知识数据中抽取出实体数据。同时,在进行实体数据抽取时,可以采用近义词抽取的方法;比如在实体名称列表中抽取出的实体名称是“篮球”,那么在对所述知识数据中的实体数据进行抽取时,可以抽取出的实体数据的实体名称为“足球”、“排球”等球类运动名词。Specifically, the entity name list stored in the database is obtained, at least one entity name in the entity name list is randomly extracted, and entity data is extracted from the knowledge data according to the entity name. At the same time, when extracting entity data, you can use the method of synonym extraction; for example, the entity name extracted from the entity name list is "basketball", then when extracting the entity data in the knowledge data, you can extract The entity names of the entity data are "soccer", "volleyball" and other ball sports nouns.
S3,将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;S3, performing dimension reduction on the multi-dimensional word vector to obtain a two-dimensional word vector, transposing the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, and elements in the entity data matrix Vectorized entity data;
具体的,在对多维词向量进行降维时可以采用PCA降维,比如,设有m条n维数据,降维操作可以采用如下步骤:Specifically, PCA can be used to reduce the dimension of multidimensional word vectors. For example, if there are m pieces of n-dimensional data, the following steps can be used to reduce the dimension:
1)将原始数据按列组成n行m列形成矩阵X;2)将X的每一行(代表一个属性字段)进行零均值化,即减去这一行的均值;3)求出X矩阵的协方差矩阵Y;4)求出协方差矩阵Y的特征值及对应的特征向量r;5)将特征向量按对应特征值大小从上到下按行排列成矩阵Z,取前k行组成矩阵Q;6)矩阵Q即为降维到k维后的数据。1) Form the original data into n rows and m columns to form a matrix X; 2) Zero-average each row of X (representing an attribute field), that is, subtract the average of this row; 3) Find the X matrix Variance matrix Y; 4) Find the eigenvalues of the covariance matrix Y and the corresponding eigenvectors r; 5) Arrange the eigenvectors according to the size of the corresponding eigenvalues from top to bottom into a matrix Z by row 6) The matrix Q is the data after dimensionality reduction to k-dimension.
S4,抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值;S4. Extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain attribute values of the real attribute data;
具体的,在进行过滤时,主要是过滤掉与语义无关的词语。可以将原始属性数据进行分割,分割成数个子数据段,然后对每一个子数据段中的数据进行属性词查询,若不存在属性词,则将所述子数据段清除。Specifically, when filtering, words that have nothing to do with semantics are filtered out. The original attribute data can be divided into several sub-data segments, and then the attribute word query can be performed on the data in each sub-data segment. If there is no attribute word, the sub-data segment is cleared.
S5,将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。S5, the elements in the entity data matrix and the attribute values of the real attribute data are entered into a credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, and the credibility is predicted The set credibility threshold is compared, and if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused.
具体的,可信度即可靠性,它指的是采取同样的方法对同一对象重复进行测量时,其所得结果相一致的程度。从另一方面来说,可信度就是指测量数据的可靠程度。其中,预设的可信度阈值是根据历史数据统计后得到的,一般可信度阈值设定为95%。Specifically, credibility is reliability, which refers to the degree of consistency of the results obtained when the same method is repeatedly measured on the same object. On the other hand, credibility refers to the reliability of measured data. Among them, the preset credibility threshold is obtained based on historical data statistics, and the general credibility threshold is set to 95%.
本实施例,通过对实体数据和属性数据进行有效加工,从而实现了同一实体的多个属性的有效融合。In this embodiment, by effectively processing the entity data and the attribute data, an effective fusion of multiple attributes of the same entity is achieved.
图2为本申请在一个实施例中的一种知识融合方法的数据获取过程示意图,如图所示,所述S1,获取知识数据来源中的数个知识数据,包括:FIG. 2 is a schematic diagram of a data acquisition process of a knowledge fusion method in an embodiment of the present application. As shown in the figure, the S1 acquires several knowledge data from a source of knowledge data, including:
S101、发送知识数据抽取指令至待抽取知识数据的所述知识数据来源;S101. Send a knowledge data extraction instruction to the source of the knowledge data to be extracted;
具体的,获取所述待抽取知识数据的知识数据来源的网络地址,根据所述网络地址的格式确定所述网络地址的类型,即所述网络地址是静态IP地址还是动态IP地址,若是静态IP地址,则从数据库中调取IP地址表进行比对后,确定所述静态IP地址是否在所述IP地址表上,在则发送知识数据获取指令,不在则不发送;若是动态IP地址,则对所述动态IP地址进行DNS解析得到所述动态IP地址对应的DNS解析代码,而后调用数据库中的DNS解析代码表对所述DNS解析代码进行比对,确定所述DNS解析代码是否在所述DNS解析代码表上,在则发送知识数据获取指令,不在则不发送。Specifically, the network address of the source of the knowledge data of the knowledge data to be extracted is obtained, and the type of the network address is determined according to the format of the network address, that is, whether the network address is a static IP address or a dynamic IP address, if it is a static IP Address, then retrieve the IP address table from the database for comparison, determine whether the static IP address is on the IP address table, send a knowledge data acquisition instruction when it is, and not send it if it is not; if it is a dynamic IP address, then Performing DNS resolution on the dynamic IP address to obtain a DNS resolution code corresponding to the dynamic IP address, and then calling a DNS resolution code table in a database to compare the DNS resolution code to determine whether the DNS resolution code is in the On the DNS resolution code table, the knowledge data acquisition instruction is sent when it is not, and it is not sent when it is not.
S102、接收所述知识数据来源的反馈信息,从所述反馈信息中抽取出数据来源类型的关键词,根据所述关键词确定所述知识来数据源的类型;S102. Receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the type of the knowledge source data source according to the keywords;
具体的,形式关键词是指知识数据是结构化数据、半结构化数据还是非结构化数据。比如,反馈信息中出现“表”这一形式关键词,则对应的是结构化数据;出现“网页”这一形式关键词,则对应的是半结构化数据;出现“文本”这一形式关键词,则对应非结构化数据。Specifically, the formal keyword refers to whether the knowledge data is structured data, semi-structured data, or unstructured data. For example, in the feedback information, the keyword of the form "table" corresponds to structured data; the keyword of the form "webpage" corresponds to semi-structured data; the key of the form "text" appears Words correspond to unstructured data.
S103、获取所述知识数据来源的类型对应的抽取方法,根据所述抽取方法抽取所述知识数据来源的数个知识数据。S103: Acquire an extraction method corresponding to the type of the knowledge data source, and extract several knowledge data of the knowledge data source according to the extraction method.
具体的,不同形式的数据来源对应不同的数据抽取方法,比如,半结构化的网页信息通常采用网络爬虫进行爬取,对于非结构化的文本,通常采用文本语言进行抽取。Specifically, different forms of data sources correspond to different data extraction methods. For example, semi-structured web page information is usually crawled by web crawlers. For unstructured text, text language is usually used for extraction.
本实施例,通过对知识数据来源的反馈信息进行分析,确定知识数据来源的数据形式,从而能够采用正确的抽取方式对知识数据来源的知识数据进行抽取。In this embodiment, by analyzing the feedback information of the source of knowledge data, the data form of the source of knowledge data is determined, so that the knowledge data of the source of knowledge data can be extracted by using the correct extraction method.
图3为本申请在一个实施例中的一种知识融合方法的向量生成过程示意图,如图所示,所述S2,抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量,包括:FIG. 3 is a schematic diagram of a vector generation process of a knowledge fusion method in an embodiment of the present application. As shown in the figure, the S2 extracts any entity data in the knowledge data, and vectors the entity data Conversion to generate multidimensional word vectors, including:
S201、设置一抽取所述知识数据中的实体数据的初始语段,所述初始语段中至少包含一个所述实体数据;S201. Set an initial segment for extracting entity data in the knowledge data, where the initial segment contains at least one entity data;
具体的,根据实体数据中实体词语的长度值历史数据,设定实体数据的初始语段的长度。比如,在数据库中存储的历史数据中,实体词语的长度从1~10,那么初始语段的长度设置为最大值10。Specifically, the length of the initial segment of the entity data is set according to the historical data of the length value of the entity words in the entity data. For example, in the historical data stored in the database, the length of the entity word is from 1 to 10, then the length of the initial segment is set to a maximum value of 10.
S202、根据所述初始语段的语段长度,将所述知识数据分割成数个初始子数据块,若任意一个所述初始子数据块中包含有两个或者以上的实体数据,则将所述初始子数据块进行再次分割得到只包含一个所述实体数据的最终子数据块;S202. Divide the knowledge data into several initial sub-data blocks according to the segment length of the initial segment. If any one of the initial sub-data blocks contains two or more entity data, the Dividing the initial sub-data block again to obtain a final sub-data block containing only one of the entity data;
具体的,在对初始语段进行分割时,每一个子数据块的长度可以不一致,即根据实际实体词语的长度确定每一个子数据块的长度。Specifically, when the initial segment is segmented, the length of each sub-data block may be inconsistent, that is, the length of each sub-data block is determined according to the length of actual entity words.
S203、抽取所述最终子数据块中的实体数据,提取所述最终子数据块中的实体数据的语义特征,应用词向量转化法将所述语义特征转换成初始多维词向量,将所述最终子数据块的语段长度作为系数与所述初始多维词向量进行乘积后得到最终多维词向量。S203. Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final The segment length of the sub-data block is used as a coefficient to multiply the initial multi-dimensional word vector to obtain the final multi-dimensional word vector.
具体的,语义特征包含语义、语法、结构等多方面的要素。词向量转化法通常使用的是Word2Vector算法,此算法可以对每一个语义特征进行上下联系,从而将相互关联的语义特征一起转换成初始多维词向量。Specifically, semantic features include elements such as semantics, grammar, and structure. The word vector conversion method usually uses the Word2Vector algorithm. This algorithm can link each semantic feature up and down, thereby transforming the related semantic features into an initial multi-dimensional word vector.
本实施例,通过实体数据进行向量化转换后,使实体数据数值化表示,便于使用机器学习方法进行相似性计算。In this embodiment, after vectorization conversion is performed on the entity data, the entity data is numerically represented, which is convenient for using a machine learning method to perform similarity calculation.
在一个实施例中,所述S3,将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据,包括:In one embodiment, in S3, the two-dimensional word vector is reduced to obtain a two-dimensional word vector, and the two-dimensional word vector is transposed and the original two-dimensional word vector is multiplied to obtain an entity data matrix, The elements in the entity data matrix are vectorized entity data, including:
获取所述多维词向量中的每一个样本点的K个近邻点;Acquiring K nearest neighbors of each sample point in the multi-dimensional word vector;
具体的,样本点是指多维向量中的每一个点;在多维空间N上每一个样本点存在着在同一平面上的直接连接的点,这些点成为近邻点,K的取值范围为1~n,n为非零正整数。Specifically, the sample points refer to each point in the multi-dimensional vector; each sample point in the multi-dimensional space N has directly connected points on the same plane, these points become nearest neighbors, and the value range of K is 1~ n, n is a non-zero positive integer.
根据每个样本点的K个近邻点,建立所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK}; According to the K nearest neighbors of each sample point, the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point is established;
根据所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK},将每个样本点映射到低维空间,映射条件为: According to the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point, each sample point is mapped to a low-dimensional space, and the mapping conditions are:
Figure PCTCN2019092597-appb-000001
式中:ε(Y)为损失函数值,y ij为近邻点数值,y n为近邻点输出向量,W ij是局部权重矩阵中的元素,K为近邻点个数,N为近邻 点输出向量中元素的个数,映射后得到二维词向量Y={y 1,y 2,…,y N};将所述二维词向量转置后得到转置二维词向量,将所述二维词向量与所述转置二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据。
Figure PCTCN2019092597-appb-000001
Where: ε(Y) is the value of the loss function, y ij is the value of the neighbor, y n is the output vector of the neighbor, W ij is the element in the local weight matrix, K is the number of neighbors, and N is the output vector of the neighbor The number of elements in the map, after mapping, a two-dimensional word vector Y={y 1 ,y 2 ,...,y N } is obtained; after transposing the two-dimensional word vector to obtain a transposed two-dimensional word vector, the two The product of the dimension word vector and the transposed two-dimensional word vector obtains an entity data matrix, and the elements in the entity data matrix are vectorized entity data.
本实施例,通过将多维词向量进行降维成二维词向量,从而方便实体信息和属性信息进行匹配。In this embodiment, it is convenient to match the entity information and the attribute information by reducing the dimension of the multi-dimensional word vector into a two-dimensional word vector.
在一个实施例中,所述S4、抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值,包括:In one embodiment, in S4, extracting original attribute data in any of the knowledge data, filtering the original attribute data to obtain real attribute data, and obtaining attribute values of the real attribute data, including:
抽取所述任一所述知识数据中的原始属性数据,离散化处理所述原始属性数据后得到所述原始属性数据的离散值;Extracting original attribute data in any of the knowledge data, and discretely processing the original attribute data to obtain discrete values of the original attribute data;
具体的,离散化是指把无限空间中有限的个体映射到有限的空间中去,以此来提高算法的时空效率。在进行离散化处理前,可以使用如unique()去重函数去除知识数据中的重复数据,而后再对知识数据离散化,其中,unique()函数是一种被C++、PHP、Matlab等开发或者科学计算环境所支持的去重函数,用于去除集合中的重复值,或者从集合中取单值。Specifically, discretization refers to the mapping of finite individuals in infinite space into a limited space to improve the space-time efficiency of the algorithm. Before discretization, you can use the unique() deduplication function to remove duplicate data in the knowledge data, and then discretize the knowledge data. Among them, the unique() function is developed by C++, PHP, Matlab, etc. or The deduplication function supported by the scientific computing environment is used to remove duplicate values in a set, or take a single value from a set.
根据所述知识数据中的所述原始属性数据的数量,获取所述原始属性数据对应的向量维度;Acquiring the vector dimension corresponding to the original attribute data according to the amount of the original attribute data in the knowledge data;
其中,原始属性数据的向量维度等于原始属性数据的数量。Among them, the vector dimension of the original attribute data is equal to the quantity of the original attribute data.
将所述离散值和所述向量维度进行做差,若差值在预设的误差阈值以内,则所述原始属性数据为真实属性数据,若所述差值不在所述误差阈值以内,则根据所述差值去除所述原始属性数据中的多余属性数据,得到所述真实属性数据;Make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, then the original attribute data is real attribute data, if the difference is not within the error threshold, based on The difference value removes redundant attribute data in the original attribute data to obtain the real attribute data;
根据所述真实属性数据的数量,获取所述真实属性数据对应的向量维度,建立真实属性数据向量;Obtain the vector dimension corresponding to the real attribute data according to the quantity of the real attribute data, and establish a real attribute data vector;
将所述真实属性数据向量降维后形成一真实属性数据矩阵,获取所述真实属性数据矩阵的特征值,所述特征值为所述属性值。After reducing the dimension of the real attribute data vector to form a real attribute data matrix, the characteristic value of the real attribute data matrix is obtained, and the characteristic value is the attribute value.
具体的,将所述真实属性数据降维后,可以得到一二维属性向量,将二维属性向量转置后得到转置二维属性向量,将二维属性向量和转置二维属性向量乘积后得到真实属性向量。Specifically, after reducing the dimension of the real attribute data, a two-dimensional attribute vector can be obtained, the two-dimensional attribute vector can be transposed to obtain a transposed two-dimensional attribute vector, and the product of the two-dimensional attribute vector and the transposed two-dimensional attribute vector can be multiplied. Then get the real attribute vector.
本实施例,通过对原始属性数据进行降维并进行矩阵化处理,更好的得到真实的属性值。In this embodiment, by reducing the dimension of the original attribute data and performing matrix processing, the real attribute value is better obtained.
在一个实施例中,所述S5,将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合,包括:In one embodiment, in S5, the elements in the entity data matrix and the attribute values of the real attribute data are entered into a credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, Comparing the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise not fused, including:
获取任一所述实体数据矩阵中的元素和任一所述真实属性数据的属性值,将所述元素和所述属性值入参到相似度距离函数中计算相似度距离,计算公式:Obtain the attribute value of any element in the entity data matrix and any real attribute data, enter the element and the attribute value into a similarity distance function to calculate the similarity distance, the calculation formula is:
Figure PCTCN2019092597-appb-000002
式中:L(m 1,m 2)为相似度距离函数,m 1为元素,m 2为属性值;
Figure PCTCN2019092597-appb-000002
In the formula: L(m 1 , m 2 ) is the similarity distance function, m 1 is the element, m 2 is the attribute value;
根据所述相似度距离,计算得到所述元素和所述属性值的可信度,计算公式为:Based on the similarity distance, the credibility of the element and the attribute value is calculated, and the calculation formula is:
Figure PCTCN2019092597-appb-000003
式中:Crd(m)为可信度函数,L(m 1,m 2)为相似度距离函数;将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的同一实体数据对应的所述原始属性数据融合,否则不融合。
Figure PCTCN2019092597-appb-000003
Where: Crd(m) is a credibility function, L(m 1 , m 2 ) is a similarity distance function; compare the credibility with a preset credibility threshold, if it is greater than the credibility The degree threshold is to fuse the original attribute data corresponding to the extracted same entity data, otherwise it will not fuse.
具体的,在进行相似度计算时还可以采用余弦算法或者欧式距离算法等,可信度阈值是根据历史数据统计后得到的。Specifically, the cosine algorithm or the Euclidean distance algorithm can also be used in the similarity calculation. The credibility threshold is obtained based on historical data statistics.
本实施例,通过对实体数据和属性数据可信度的计算,从而提升了属性数据融合的准确性。In this embodiment, by calculating the credibility of the entity data and attribute data, the accuracy of attribute data fusion is improved.
在一个实施例中,所述S103、获取所述知识数据来源的类型对应的抽取方法,根据所述抽取方法抽取所述知识数据来源的数个知识数据包括:In one embodiment, in S103, an extraction method corresponding to the type of the knowledge data source is obtained, and extracting several pieces of knowledge data of the knowledge data source according to the extraction method includes:
若所述获取所述知识数据来源的形式为网页,则应用网络爬虫工具进行抽取,包括:If the form of acquiring the source of the knowledge data is a web page, then extracting using a web crawler tool includes:
获取预抽取知识数据的任务队列中的关键词组,所述关键词组中包含有多个关键词;其中,任务队列中的关键词组可以是一些性状性的词组,比如:“球类”,在这个关键词组下所包含的关键词可以有“篮球”、“足球”、“乒乓球”等。Keyword group in the task queue for obtaining pre-extracted knowledge data, the keyword group contains multiple keywords; among them, the keyword group in the task queue may be some trait phrases, such as: "ball", in this The keywords included under the keyword group may include "basketball", "football", "table tennis" and so on.
遍历所述关键词组,通过网络爬虫爬取与所述关键词组中每一个关键词对应的网页上的信息;获取所述网页上的信息中的所有实体信息,将所述实体信息导入到预设的知识数据表中,若有一个及以上的实体信息无法导入到预设的 知识数据表中,则重新通过网络爬虫爬取网页,否则将所述网页信息作为所述知识数据。Traverse the keyword group, crawl the information on the web page corresponding to each keyword in the keyword group through a web crawler; obtain all entity information in the information on the web page, and import the entity information into a preset In the knowledge data table of, if there is one or more entity information that cannot be imported into the preset knowledge data table, the web crawler is used to crawl the web page again, otherwise the web page information is used as the knowledge data.
具体的,实体信息是指实体名称等和“实体”相关的信息,在导入到预设的知识数据表中时,先检索所述预设的知识数据表中的实体名称,若某一个实体信息中的实体名称不在预设的知识数据表中,则无法将所述实体信息导入。其中,预设的知识数据表存储在数据库中,其根据历次知识数据采集后汇总而成。Specifically, the entity information refers to the information related to the "entity" such as the name of the entity. When imported into the preset knowledge data table, the entity name in the preset knowledge data table is retrieved first, if a certain entity information If the entity name in is not in the preset knowledge data table, the entity information cannot be imported. Among them, the preset knowledge data table is stored in the database, which is collected after collecting all previous knowledge data.
本实施例,能够有效的从网页信息中抽取出所需要的知识数据,提升知识数据抽取的效率。In this embodiment, the required knowledge data can be effectively extracted from the web page information, and the efficiency of knowledge data extraction can be improved.
在一个实施例中,提出了一种知识融合装置,如图4所示,包括如下模块:In one embodiment, a knowledge fusion device is proposed. As shown in FIG. 4, it includes the following modules:
数据获取模块41,设置为获取知识数据来源中的数个知识数据;The data acquisition module 41 is configured to acquire several knowledge data from the source of knowledge data;
向量生成模块42,设置为抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;The vector generation module 42 is configured to extract entity data from any of the knowledge data, convert the entity data into vectors, and generate multidimensional word vectors;
数据向量化模块43,设置为将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;The data vectorization module 43 is configured to obtain a two-dimensional word vector after reducing the dimension of the multi-dimensional word vector, and multiply the two-dimensional word vector with the original two-dimensional word vector to obtain an entity data matrix. The elements in the entity data matrix are vectorized entity data;
属性值获取模块44,设置为抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值;The attribute value obtaining module 44 is set to extract original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain attribute values of the real attribute data;
融合判定模块45,设置为将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。The fusion determination module 45 is configured to input the elements in the entity data matrix and the attribute values of the real attribute data into the credibility recognition model, obtain the credibility of the knowledge data after the parameters are obtained, and convert the The credibility is compared with a preset credibility threshold, and if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused.
在一个实施例中,所述向量生成模块,还设置为:In one embodiment, the vector generation module is further set to:
发送知识数据抽取指令至待抽取知识数据的所述知识数据来源;接收所述知识数据来源的反馈信息,从所述反馈信息中抽取出数据来源类型的关键词,根据所述关键词确定所述知识来数据源的类型;获取所述知识数据来源的类型对应的抽取方法,根据所述抽取方法抽取所述知识数据来源的数个知识数据。Send a knowledge data extraction instruction to the knowledge data source of the knowledge data to be extracted; receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the keywords according to the keywords Knowledge comes from the type of the data source; obtain the extraction method corresponding to the type of the knowledge data source, and extract several pieces of knowledge data from the knowledge data source according to the extraction method.
在一个实施例中,所述数据获取模块,还设置为:In one embodiment, the data acquisition module is further set to:
设置一抽取所述知识数据中的实体数据的初始语段,所述初始语段中至少包含一个所述实体数据;根据所述初始语段的语段长度,将所述知识数据分割成数个初始子数据块,若任意一个所述初始子数据块中包含有两个或者以上的实体数据,则将所述初始子数据块进行再次分割得到只包含一个所述实体数据的最终子数据块;抽取所述最终子数据块中的实体数据,提取所述最终子数据 块中的实体数据的语义特征,应用词向量转化法将所述语义特征转换成初始多维词向量,将所述最终子数据块的语段长度作为系数与所述初始多维词向量进行乘积后得到最终多维词向量。Setting an initial segment for extracting entity data in the knowledge data, the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data The segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.
在一个实施例中,所述数据向量化模块,还设置为:In one embodiment, the data vectorization module is further set to:
获取所述多维词向量中的每一个样本点的K个近邻点;根据每个样本点的K个近邻点,建立所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK}; Obtain the K nearest neighbors of each sample point in the multi-dimensional word vector; according to the K nearest neighbors of each sample point, establish a local weight matrix W i ={w i1 ,w i2 of each sample point, …,W iK };
根据所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK},将每个样本点映射到低维空间,映射后得到二维词向量Y={y 1,y 2,…,y N};将所述二维词向量转置后得到转置二维词向量,将所述二维词向量与所述转置二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据。 According to the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point, each sample point is mapped to a low-dimensional space, and the two-dimensional word vector Y={y 1 is obtained after the mapping ,y 2 ,...,y N }; transpose the two-dimensional word vector to obtain a transposed two-dimensional word vector, and multiply the two-dimensional word vector and the transposed two-dimensional word vector to obtain an entity data matrix, The elements in the entity data matrix are vectorized entity data.
在一个实施例中,所述属性值获取模块,还设置为设置为:In one embodiment, the attribute value acquisition module is also set to:
抽取所述任一所述知识数据中的原始属性数据,离散化处理所述原始属性数据后得到所述原始属性数据的离散值;根据所述知识数据中的所述原始属性数据的数量,获取所述原始属性数据对应的向量维度;将所述离散值和所述向量维度进行做差,若差值在预设的误差阈值以内,则所述原始属性数据为真实属性数据,若所述差值不在所述误差阈值以内,则根据所述差值去除所述原始属性数据中的多余属性数据,得到所述真实属性数据;根据所述真实属性数据的数量,获取所述真实属性数据对应的向量维度,建立真实属性数据向量;Extract the original attribute data in any one of the knowledge data, and discretely process the original attribute data to obtain the discrete value of the original attribute data; according to the number of the original attribute data in the knowledge data, obtain The vector dimension corresponding to the original attribute data; make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, the original attribute data is real attribute data, if the difference If the value is not within the error threshold, the redundant attribute data in the original attribute data is removed according to the difference to obtain the real attribute data; according to the number of the real attribute data, the corresponding Vector dimension, establish real attribute data vector;
将所述真实属性数据向量降维后形成一真实属性数据矩阵,获取所述真实属性数据矩阵的特征值,所述特征值为所述属性值。After reducing the dimension of the real attribute data vector to form a real attribute data matrix, the characteristic value of the real attribute data matrix is obtained, and the characteristic value is the attribute value.
在一个实施例中,所述融合判定模块,还设置为:In one embodiment, the fusion determination module is further configured to:
获取任一所述实体数据矩阵中的元素和任一所述真实属性数据的属性值,将所述元素和所述属性值入参到相似度距离函数中计算相似度距离;根据所述相似度距离,计算得到所述元素和所述属性值的可信度;将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的同一实体数据对应的所述原始属性数据融合,否则不融合。Obtain the attribute value of any element in the entity data matrix and any real attribute data, and enter the element and the attribute value into a similarity distance function to calculate a similarity distance; according to the similarity Distance, calculate the credibility of the element and the attribute value; compare the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the same entity data will be extracted The corresponding original attribute data is fused, otherwise it is not fused.
在一个实施例中,所述向量生成模块,还设置为:In one embodiment, the vector generation module is further set to:
获取预抽取知识数据的任务队列中的关键词组,所述关键词组中包含有多个关键词;遍历所述关键词组,通过网络爬虫爬取与所述关键词组中每一个关键词对应的网页上的信息;获取所述网页上的信息中的所有实体信息,将所述实体信息导入到预设的知识数据表中,若有一个及以上的实体信息无法导入到预设的知识数据表中,则重新通过网络爬虫爬取网页,否则将所述网页信息作为所述知识数据。Obtain the keyword group in the task queue of pre-extracted knowledge data, the keyword group contains multiple keywords; traverse the keyword group, crawl a webpage corresponding to each keyword in the keyword group through a web crawler Information; obtain all the entity information in the information on the web page, import the entity information into the preset knowledge data table, if there is one or more entity information cannot be imported into the preset knowledge data table, Then crawl the webpage through the web crawler again, otherwise the webpage information is used as the knowledge data.
在一个实施例中,提出了一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述各实施例中所述知识融合方法的步骤。In one embodiment, a computer device is proposed. The computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the computer device The processor executes the steps of the knowledge fusion method described in the above embodiments.
在一个实施例中,提出了一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述各实施例中所述知识融合方法的步骤。所述存储介质可以为非易失性存储介质。In one embodiment, a storage medium storing computer-readable instructions is proposed. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the above-mentioned embodiments. Describe the steps of the knowledge fusion method. The storage medium may be a non-volatile storage medium.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the foregoing embodiments may be completed by instructing relevant hardware through a program. The program may be stored in a computer-readable storage medium, and the storage medium may include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be arbitrarily combined. In order to simplify the description, all possible combinations of the technical features in the above-mentioned embodiments are not described. It should be considered as the scope described in this specification.
以上所述实施例仅表达了本申请一些示例性实施例,其中描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express some exemplary embodiments of the present application, and the description thereof is more specific and detailed, but it should not be construed as limiting the patent scope of the present application. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, a number of modifications and improvements can also be made, which all fall within the protection scope of the present application. Therefore, the protection scope of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种知识融合方法,其中,包括:A knowledge fusion method, which includes:
    获取知识数据来源中的数个知识数据;抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值;将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。Obtaining several pieces of knowledge data from the source of knowledge data; extracting entity data from any of the knowledge data, vectorizing the entity data to generate multidimensional word vectors; reducing the dimension of the multidimensional word vectors to obtain two Dimension word vector, multiplying the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, the elements in the entity data matrix are vectorized entity data; extract any of the knowledge data The original attribute data in the original attribute data is filtered to obtain real attribute data, and the attribute values of the real attribute data are obtained; the elements in the entity data matrix and the attribute values of the real attribute data are entered into the parameter Go to the credibility recognition model, get the credibility of the knowledge data after taking out the parameters, compare the credibility with the preset credibility threshold, if it is greater than the credibility threshold, it will be extracted The original attribute data is fused, otherwise it is not fused.
  2. 根据权利要求1所述的知识融合方法,其中,所述获取知识数据来源中的数个知识数据,包括:The knowledge fusion method according to claim 1, wherein the acquiring knowledge data in the source of knowledge data includes:
    发送知识数据抽取指令至待抽取知识数据的所述知识数据来源;接收所述知识数据来源的反馈信息,从所述反馈信息中抽取出数据来源类型的关键词,根据所述关键词确定所述知识来数据源的类型;获取所述知识数据来源的类型对应的抽取方法,根据所述抽取方法抽取所述知识数据来源的数个知识数据。Send a knowledge data extraction instruction to the knowledge data source of the knowledge data to be extracted; receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the keywords according to the keywords Knowledge comes from the type of the data source; obtain the extraction method corresponding to the type of the knowledge data source, and extract several pieces of knowledge data from the knowledge data source according to the extraction method.
  3. 根据权利要求1所述的知识融合方法,其中,所述抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量,包括:The knowledge fusion method according to claim 1, wherein the extracting entity data in any of the knowledge data, vectorizing the entity data to generate a multi-dimensional word vector includes:
    设置一抽取所述知识数据中的实体数据的初始语段,所述初始语段中至少包含一个所述实体数据;根据所述初始语段的语段长度,将所述知识数据分割成数个初始子数据块,若任意一个所述初始子数据块中包含有两个或者以上的实体数据,则将所述初始子数据块进行再次分割得到只包含一个所述实体数据的最终子数据块;抽取所述最终子数据块中的实体数据,提取所述最终子数据块中的实体数据的语义特征,应用词向量转化法将所述语义特征转换成初始多维词向量,将所述最终子数据块的语段长度作为系数与所述初始多维词向量进行乘积后得到最终多维词向量。Setting an initial segment for extracting entity data in the knowledge data, the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data The segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.
  4. 根据权利要求1所述的知识融合方法,其中,所述将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据,包括:The knowledge fusion method according to claim 1, wherein the two-dimensional word vector is obtained by performing dimension reduction on the multi-dimensional word vector, and the two-dimensional word vector is transposed and multiplied by the original two-dimensional word vector An entity data matrix is obtained, and the elements in the entity data matrix are vectorized entity data, including:
    获取所述多维词向量中的每一个样本点的K个近邻点;根据每个样本点的K个近邻点,建立所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK};根据所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK},将每个样本点映射到低维空间,映射条件为: Obtain the K nearest neighbors of each sample point in the multi-dimensional word vector; according to the K nearest neighbors of each sample point, establish a local weight matrix W i ={w i1 ,w i2 of each sample point, ..., w iK }; according to the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point, map each sample point to a low-dimensional space, and the mapping conditions are:
    Figure PCTCN2019092597-appb-100001
    式中:ε(Y)为损失函数值,y ij为近邻点数值,y n为近邻点输出向量,w ij是局部权重矩阵中的元素,K为近邻点的个数,N为近邻点输出向量中元素的个数,映射后得到二维词向量Y={y 1,y 2,…,y N};将所述二维词向量转置后得到转置二维词向量,将所述二维词向量与所述转置二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据。
    Figure PCTCN2019092597-appb-100001
    Where: ε(Y) is the value of the loss function, y ij is the value of the nearest neighbor, y n is the output vector of the nearest neighbor, w ij is the element in the local weight matrix, K is the number of neighbors, and N is the output of the neighbor The number of elements in the vector, after mapping, a two-dimensional word vector Y={y 1 ,y 2 ,...,y N } is obtained; after transposing the two-dimensional word vector to obtain a transposed two-dimensional word vector, the The product of the two-dimensional word vector and the transposed two-dimensional word vector obtains an entity data matrix, and the elements in the entity data matrix are vectorized entity data.
  5. 根据权利要求1所述的知识融合方法,其中,所述抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值,包括:The knowledge fusion method according to claim 1, wherein the original attribute data in any of the knowledge data is extracted, the original attribute data is filtered to obtain real attribute data, and the attributes of the real attribute data are obtained Values, including:
    抽取所述任一所述知识数据中的原始属性数据,离散化处理所述原始属性数据后得到所述原始属性数据的离散值;根据所述知识数据中的所述原始属性数据的数量,获取所述原始属性数据对应的向量维度;将所述离散值和所述向量维度进行做差,若差值在预设的误差阈值以内,则所述原始属性数据为真实属性数据,若所述差值不在所述误差阈值以内,则根据所述差值去除所述原始属性数据中的多余属性数据,得到所述真实属性数据;根据所述真实属性数据的数量,获取所述真实属性数据对应的向量维度,建立真实属性数据向量;将所述真实属性数据向量降维后形成一真实属性数据矩阵,获取所述真实属性数据矩阵的特征值,所述特征值为所述属性值。Extract the original attribute data in any one of the knowledge data, and discretely process the original attribute data to obtain the discrete value of the original attribute data; according to the number of the original attribute data in the knowledge data, obtain The vector dimension corresponding to the original attribute data; make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, the original attribute data is real attribute data, if the difference If the value is not within the error threshold, the redundant attribute data in the original attribute data is removed according to the difference to obtain the real attribute data; according to the number of the real attribute data, the corresponding The vector dimension establishes the real attribute data vector; the real attribute data vector is reduced in dimension to form a real attribute data matrix, and the characteristic values of the real attribute data matrix are obtained, and the characteristic values are the attribute values.
  6. 根据权利要求1所述的知识融合方法,其中,所述将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合,包括:The knowledge fusion method according to claim 1, wherein the inputting the elements in the entity data matrix and the attribute values of the real attribute data into a credibility recognition model, and obtaining the knowledge data after taking out the parameters The credibility, compare the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise not fused, including:
    获取任一所述实体数据矩阵中的元素和任一所述真实属性数据的属性值,将所述元素和所述属性值入参到相似度距离函数中计算相似度距离,计算公式为:
    Figure PCTCN2019092597-appb-100002
    式中:L(m 1,m 2)为相似度距离函数,m 1为元素,m 2为属性值;根据所述相似度距离,计算得到所述元素和所述属性值的可信度,计算公式为:
    Obtain the attribute value of any element in the entity data matrix and any real attribute data, and enter the element and the attribute value into a similarity distance function to calculate the similarity distance. The calculation formula is:
    Figure PCTCN2019092597-appb-100002
    Where: L(m 1 , m 2 ) is a similarity distance function, m 1 is an element, and m 2 is an attribute value; according to the similarity distance, the credibility of the element and the attribute value is calculated, The calculation formula is:
    Figure PCTCN2019092597-appb-100003
    式中:Crd(m)为可信度函数,L(m 1,m 2)为相似度距离函数;将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的同一实体数据对应的所述原始属性数据融合,否则不融合。
    Figure PCTCN2019092597-appb-100003
    Where: Crd(m) is a credibility function, L(m 1 , m 2 ) is a similarity distance function; compare the credibility with a preset credibility threshold, if it is greater than the credibility The degree threshold is to fuse the original attribute data corresponding to the extracted same entity data, otherwise it will not fuse.
  7. 根据权利要求2所述的知识融合方法,其中,所述获取所述知识数据来源的类型对应的抽取方法,根据所述抽取方法抽取所述知识数据来源的数个知识数据包括:若所述获取所述知识数据来源的形式为网页,则应用网络爬虫工具进行抽取,包括:获取预抽取知识数据的任务队列中的关键词组,所述关键词组中包含有多个关键词;遍历所述关键词组,通过网络爬虫爬取与所述关键词组中每一个关键词对应的网页上的信息;获取所述网页上的信息中的所有实体信息,将所述实体信息导入到预设的知识数据表中,若有一个及以上的实体信息无法导入到预设的知识数据表中,则重新通过网络爬虫爬取网页,否则将所述网页信息作为所述知识数据。The knowledge fusion method according to claim 2, wherein the extracting method corresponding to the type of acquiring the knowledge data source, extracting several pieces of knowledge data of the knowledge data source according to the extracting method includes: if the acquiring The form of the source of the knowledge data is a web page, and the extraction is performed using a web crawler tool, including: obtaining a keyword group in a task queue of pre-extracted knowledge data, the keyword group contains multiple keywords; , Crawling information on a webpage corresponding to each keyword in the keyword group through a web crawler; acquiring all entity information in the information on the webpage, and importing the entity information into a preset knowledge data table If there is one or more entity information that cannot be imported into the preset knowledge data table, the web crawler is used to crawl the web page again; otherwise, the web page information is used as the knowledge data.
  8. 一种知识融合装置,其中,包括:A knowledge fusion device, including:
    数据获取模块,设置为获取知识数据来源中的数个知识数据;The data acquisition module is set to acquire several knowledge data from the source of knowledge data;
    向量生成模块,设置为抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;The vector generation module is configured to extract entity data from any of the knowledge data, convert the entity data into vectors, and generate multidimensional word vectors;
    数据向量化模块,设置为将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;A data vectorization module, configured to reduce the dimension of the multi-dimensional word vector to obtain a two-dimensional word vector, multiply the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, and the entity The elements in the data matrix are vectorized entity data;
    属性值获取模块,设置为抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值;The attribute value obtaining module is set to extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain the attribute value of the real attribute data;
    融合判定模块,设置为将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。The fusion determination module is configured to input the elements in the entity data matrix and the attribute values of the real attribute data into the credibility recognition model, and after obtaining the parameters, obtain the credibility of the knowledge data, and then convert the The reliability is compared with a preset reliability threshold, and if it is greater than the reliability threshold, the extracted original attribute data is fused, otherwise it is not fused.
  9. 根据权利要求8所述的知识融合装置,其中,所述向量生成模块,还设置为:The knowledge fusion device according to claim 8, wherein the vector generation module is further configured to:
    发送知识数据抽取指令至待抽取知识数据的所述知识数据来源;接收所述知识数据来源的反馈信息,从所述反馈信息中抽取出数据来源类型的关键词, 根据所述关键词确定所述知识来数据源的类型;获取所述知识数据来源的类型对应的抽取方法,根据所述抽取方法抽取所述知识数据来源的数个知识数据。Send a knowledge data extraction instruction to the knowledge data source of the knowledge data to be extracted; receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the keywords according to the keywords Knowledge comes from the type of the data source; obtain the extraction method corresponding to the type of the knowledge data source, and extract several pieces of knowledge data from the knowledge data source according to the extraction method.
  10. 根据权利要求8所述的知识融合装置,其中,所述数据获取模块,还设置为:The knowledge fusion device according to claim 8, wherein the data acquisition module is further configured to:
    设置一抽取所述知识数据中的实体数据的初始语段,所述初始语段中至少包含一个所述实体数据;根据所述初始语段的语段长度,将所述知识数据分割成数个初始子数据块,若任意一个所述初始子数据块中包含有两个或者以上的实体数据,则将所述初始子数据块进行再次分割得到只包含一个所述实体数据的最终子数据块;抽取所述最终子数据块中的实体数据,提取所述最终子数据块中的实体数据的语义特征,应用词向量转化法将所述语义特征转换成初始多维词向量,将所述最终子数据块的语段长度作为系数与所述初始多维词向量进行乘积后得到最终多维词向量。Setting an initial segment for extracting entity data in the knowledge data, the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data The segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.
  11. 根据权利要求8所述的知识融合装置,其中,所述数据向量化模块,还设置为:The knowledge fusion device according to claim 8, wherein the data vectorization module is further configured to:
    获取所述多维词向量中的每一个样本点的K个近邻点;根据每个样本点的K个近邻点,建立所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK}; Obtain the K nearest neighbors of each sample point in the multi-dimensional word vector; according to the K nearest neighbors of each sample point, establish a local weight matrix W i ={w i1 ,w i2 of each sample point, …,W iK };
    根据所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK},将每个样本点映射到低维空间,映射后得到二维词向量Y={y 1,y 2,…,y N};将所述二维词向量转置后得到转置二维词向量,将所述二维词向量与所述转置二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据。 According to the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point, each sample point is mapped to a low-dimensional space, and the two-dimensional word vector Y={y 1 is obtained after the mapping ,y 2 ,...,y N }; transpose the two-dimensional word vector to obtain a transposed two-dimensional word vector, and multiply the two-dimensional word vector and the transposed two-dimensional word vector to obtain an entity data matrix, The elements in the entity data matrix are vectorized entity data.
  12. 根据权利要求8所述的知识融合装置,其中,所述属性值获取模块,还设置为:The knowledge fusion device according to claim 8, wherein the attribute value acquisition module is further set to:
    抽取所述任一所述知识数据中的原始属性数据,离散化处理所述原始属性数据后得到所述原始属性数据的离散值;根据所述知识数据中的所述原始属性数据的数量,获取所述原始属性数据对应的向量维度;将所述离散值和所述向量维度进行做差,若差值在预设的误差阈值以内,则所述原始属性数据为真实属性数据,若所述差值不在所述误差阈值以内,则根据所述差值去除所述原始属性数据中的多余属性数据,得到所述真实属性数据;根据所述真实属性数据的数量,获取所述真实属性数据对应的向量维度,建立真实属性数据向量;Extract the original attribute data in any one of the knowledge data, and discretely process the original attribute data to obtain the discrete value of the original attribute data; according to the number of the original attribute data in the knowledge data, obtain The vector dimension corresponding to the original attribute data; make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, the original attribute data is real attribute data, if the difference If the value is not within the error threshold, the redundant attribute data in the original attribute data is removed according to the difference to obtain the real attribute data; according to the number of the real attribute data, the corresponding Vector dimension, establish real attribute data vector;
    将所述真实属性数据向量降维后形成一真实属性数据矩阵,获取所述真实属性数据矩阵的特征值,所述特征值为所述属性值。After reducing the dimension of the real attribute data vector to form a real attribute data matrix, the characteristic value of the real attribute data matrix is obtained, and the characteristic value is the attribute value.
  13. 根据权利要求8所述的知识融合装置,其中,所述融合判定模块,还设置为:The knowledge fusion device according to claim 8, wherein the fusion determination module is further configured to:
    获取任一所述实体数据矩阵中的元素和任一所述真实属性数据的属性值,将所述元素和所述属性值入参到相似度距离函数中计算相似度距离;根据所述相似度距离,计算得到所述元素和所述属性值的可信度;将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的同一实体数据对应的所述原始属性数据融合,否则不融合。Obtain the attribute value of any element in the entity data matrix and any real attribute data, and enter the element and the attribute value into a similarity distance function to calculate a similarity distance; according to the similarity Distance, calculate the credibility of the element and the attribute value; compare the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the same entity data will be extracted The corresponding original attribute data is fused, otherwise it is not fused.
  14. 根据权利要求9所述的知识融合装置,其中,所述向量生成模块,还设置为:The knowledge fusion device according to claim 9, wherein the vector generation module is further configured to:
    获取预抽取知识数据的任务队列中的关键词组,所述关键词组中包含有多个关键词;遍历所述关键词组,通过网络爬虫爬取与所述关键词组中每一个关键词对应的网页上的信息;获取所述网页上的信息中的所有实体信息,将所述实体信息导入到预设的知识数据表中,若有一个及以上的实体信息无法导入到预设的知识数据表中,则重新通过网络爬虫爬取网页,否则将所述网页信息作为所述知识数据。Obtain the keyword group in the task queue of pre-extracted knowledge data, the keyword group contains multiple keywords; traverse the keyword group, crawl a webpage corresponding to each keyword in the keyword group through a web crawler Information; obtain all the entity information in the information on the web page, import the entity information into the preset knowledge data table, if there is one or more entity information cannot be imported into the preset knowledge data table, Then crawl the webpage through the web crawler again, otherwise the webpage information is used as the knowledge data.
  15. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:A computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor causes the processor to perform the following steps:
    获取知识数据来源中的数个知识数据;抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实属性数据,获取所述真实属性数据的属性值;将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。Obtaining several pieces of knowledge data from the source of knowledge data; extracting entity data from any of the knowledge data, vectorizing the entity data to generate multidimensional word vectors; reducing the dimension of the multidimensional word vectors to obtain two Dimension word vector, multiplying the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, the elements in the entity data matrix are vectorized entity data; extract any of the knowledge data The original attribute data in the original attribute data is filtered to obtain real attribute data, and the attribute values of the real attribute data are obtained; the elements in the entity data matrix and the attribute values of the real attribute data are entered into the parameter Go to the credibility recognition model, get the credibility of the knowledge data after taking out the parameters, compare the credibility with the preset credibility threshold, if it is greater than the credibility threshold, it will be extracted The original attribute data is fused, otherwise it is not fused.
  16. 一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:A storage medium storing computer-readable instructions, which when executed by one or more processors, causes the one or more processors to perform the following steps:
    获取知识数据来源中的数个知识数据;抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量;将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据;抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤后得到真实 属性数据,获取所述真实属性数据的属性值;将所述实体数据矩阵中的元素和所述真实属性数据的属性值入参到可信度识别模型,出参后得到所述知识数据的可信度,将所述可信度与预设的可信度阈值进行比较,若大于所述可信度阈值则将抽取出的所述原始属性数据融合,否则不融合。Obtain several pieces of knowledge data from the source of knowledge data; extract the entity data in any of the knowledge data, convert the entity data into vectors to generate multidimensional word vectors; reduce the multidimensional word vectors to obtain two Dimension word vector, multiplying the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, the elements in the entity data matrix are vectorized entity data; extract any of the knowledge data The original attribute data in the filter, filtering the original attribute data to obtain real attribute data, and obtaining the attribute values of the real attribute data; adding the elements in the entity data matrix and the attribute values of the real attribute data into the parameter Go to the credibility recognition model, get the credibility of the knowledge data after taking out the parameters, compare the credibility with the preset credibility threshold, if it is greater than the credibility threshold, it will be extracted The original attribute data is fused, otherwise it is not fused.
  17. 根据权利要求16所述的一种存储有计算机可读指令的存储介质,其中,所述获取知识数据来源中的数个知识数据时,使得所述处理器执行以下步骤:A storage medium storing computer readable instructions according to claim 16, wherein, when acquiring several pieces of knowledge data from a source of knowledge data, the processor is caused to perform the following steps:
    发送知识数据抽取指令至待抽取知识数据的所述知识数据来源;接收所述知识数据来源的反馈信息,从所述反馈信息中抽取出数据来源类型的关键词,根据所述关键词确定所述知识来数据源的类型;获取所述知识数据来源的类型对应的抽取方法,根据所述抽取方法抽取所述知识数据来源的数个知识数据。Send a knowledge data extraction instruction to the knowledge data source of the knowledge data to be extracted; receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the keywords according to the keywords Knowledge comes from the type of the data source; obtain the extraction method corresponding to the type of the knowledge data source, and extract several pieces of knowledge data from the knowledge data source according to the extraction method.
  18. 根据权利要求16所述的一种存储有计算机可读指令的存储介质,其中,所述抽取任一所述知识数据中的实体数据,将所述实体数据进行向量化转换,生成多维词向量时,使得所述处理器执行以下步骤:A storage medium storing computer readable instructions according to claim 16, wherein the entity data in any of the knowledge data is extracted, and the entity data is vectorized to generate a multi-dimensional word vector So that the processor performs the following steps:
    设置一抽取所述知识数据中的实体数据的初始语段,所述初始语段中至少包含一个所述实体数据;根据所述初始语段的语段长度,将所述知识数据分割成数个初始子数据块,若任意一个所述初始子数据块中包含有两个或者以上的实体数据,则将所述初始子数据块进行再次分割得到只包含一个所述实体数据的最终子数据块;抽取所述最终子数据块中的实体数据,提取所述最终子数据块中的实体数据的语义特征,应用词向量转化法将所述语义特征转换成初始多维词向量,将所述最终子数据块的语段长度作为系数与所述初始多维词向量进行乘积后得到最终多维词向量。Setting an initial segment for extracting entity data in the knowledge data, the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data The segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.
  19. 根据权利要求16所述的一种存储有计算机可读指令的存储介质,其中,所述将所述多维词向量进行降维后得到二维词向量,将所述二维词向量转置后与原所述二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据时,使得所述处理器执行以下步骤:A storage medium storing computer readable instructions according to claim 16, wherein said dimensionality reduction of said multidimensional word vector obtains a two-dimensional word vector, and after transposing said two-dimensional word vector The original two-dimensional word vector is multiplied to obtain an entity data matrix. When the elements in the entity data matrix are vectorized entity data, the processor is caused to perform the following steps:
    获取所述多维词向量中的每一个样本点的K个近邻点;根据每个样本点的K个近邻点,建立所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK};根据所述每个样本点的局部权重矩阵W i={w i1,w i2,…,w iK},将每个样本点映射到低维空间,映射后得到二维词向量Y={y 1,y 2,…,y N};将所述二维词向量转置后得到转置二维词向量,将所述二维词向量与所述转置二维词向量乘积得到实体数据矩阵,所述实体数据矩阵中的元素为向量化的实体数据。 Obtain the K nearest neighbors of each sample point in the multi-dimensional word vector; according to the K nearest neighbors of each sample point, establish a local weight matrix W i ={w i1 ,w i2 of each sample point, …, w iK }; according to the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point, map each sample point to a low-dimensional space, and the two-dimensional words are obtained after the mapping Vector Y={y 1 ,y 2 ,...,y N }; transpose the two-dimensional word vector to obtain a transposed two-dimensional word vector, and convert the two-dimensional word vector and the transposed two-dimensional word vector The product obtains an entity data matrix, and the elements in the entity data matrix are vectorized entity data.
  20. 根据权利要求16所述的一种存储有计算机可读指令的存储介质,其中,所述抽取任一所述知识数据中的原始属性数据,对所述原始属性数据进行过滤 后得到真实属性数据,获取所述真实属性数据的属性值时,使得所述处理器执行以下步骤:A storage medium storing computer-readable instructions according to claim 16, wherein the original attribute data in any of the knowledge data is extracted, and the original attribute data is filtered to obtain real attribute data, When acquiring the attribute value of the real attribute data, the processor is caused to perform the following steps:
    抽取所述任一所述知识数据中的原始属性数据,离散化处理所述原始属性数据后得到所述原始属性数据的离散值;根据所述知识数据中的所述原始属性数据的数量,获取所述原始属性数据对应的向量维度;将所述离散值和所述向量维度进行做差,若差值在预设的误差阈值以内,则所述原始属性数据为真实属性数据,若所述差值不在所述误差阈值以内,则根据所述差值去除所述原始属性数据中的多余属性数据,得到所述真实属性数据;根据所述真实属性数据的数量,获取所述真实属性数据对应的向量维度,建立真实属性数据向量;将所述真实属性数据向量降维后形成一真实属性数据矩阵,获取所述真实属性数据矩阵的特征值,所述特征值为所述属性值。Extract the original attribute data in any one of the knowledge data, and discretely process the original attribute data to obtain the discrete value of the original attribute data; according to the number of the original attribute data in the knowledge data, obtain The vector dimension corresponding to the original attribute data; make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, the original attribute data is real attribute data, if the difference If the value is not within the error threshold, the redundant attribute data in the original attribute data is removed according to the difference to obtain the real attribute data; according to the number of the real attribute data, the corresponding The vector dimension establishes the real attribute data vector; the real attribute data vector is reduced in dimension to form a real attribute data matrix, and the characteristic values of the real attribute data matrix are obtained, and the characteristic values are the attribute values.
PCT/CN2019/092597 2019-01-11 2019-06-24 Knowledge fusion method and apparatus, computer device, and storage medium WO2020143184A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910025114.4A CN109886294B (en) 2019-01-11 2019-01-11 Knowledge fusion method, apparatus, computer device and storage medium
CN201910025114.4 2019-01-11

Publications (1)

Publication Number Publication Date
WO2020143184A1 true WO2020143184A1 (en) 2020-07-16

Family

ID=66925944

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/092597 WO2020143184A1 (en) 2019-01-11 2019-06-24 Knowledge fusion method and apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN109886294B (en)
WO (1) WO2020143184A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036481A (en) * 2020-08-31 2020-12-04 国家电网有限公司 Reverse verification method for improving fusion effect
CN112182320A (en) * 2020-09-25 2021-01-05 中国建设银行股份有限公司 Clustering data processing method and device, computer equipment and storage medium
CN112784065A (en) * 2021-02-01 2021-05-11 东北大学 Unsupervised knowledge graph fusion method and unsupervised knowledge graph fusion device based on multi-order neighborhood attention network
CN112949745A (en) * 2021-03-23 2021-06-11 中国检验检疫科学研究院 Fusion processing method and device for multi-source data, electronic equipment and storage medium
CN113111657A (en) * 2021-03-04 2021-07-13 浙江工业大学 Cross-language knowledge graph alignment and fusion method, device and storage medium
CN113468255A (en) * 2021-06-25 2021-10-01 西安电子科技大学 Knowledge graph-based data fusion method in social security comprehensive treatment field
CN113723047A (en) * 2021-07-27 2021-11-30 山东旗帜信息有限公司 Map construction method, device and medium based on legal document
CN114139547A (en) * 2021-11-25 2022-03-04 北京中科闻歌科技股份有限公司 Knowledge fusion method, device, equipment, system and medium
CN114625875A (en) * 2022-03-09 2022-06-14 平安科技(深圳)有限公司 Pattern matching method, device, storage medium and equipment for multi-data source information
CN117033541A (en) * 2023-10-09 2023-11-10 中南大学 Space-time knowledge graph indexing method and related equipment
CN112949745B (en) * 2021-03-23 2024-04-19 中国检验检疫科学研究院 Fusion processing method and device for multi-source data, electronic equipment and storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886294B (en) * 2019-01-11 2024-01-23 平安科技(深圳)有限公司 Knowledge fusion method, apparatus, computer device and storage medium
CN110807102B (en) * 2019-09-19 2023-09-29 平安科技(深圳)有限公司 Knowledge fusion method, apparatus, computer device and storage medium
CN111159328A (en) * 2019-11-20 2020-05-15 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Information knowledge fusion system and method
CN111782818A (en) * 2020-06-05 2020-10-16 牛张明 Device, method and system for constructing biomedical knowledge graph and memory
CN112908303A (en) * 2021-01-28 2021-06-04 广东优碧胜科技有限公司 Audio signal processing method and device and electronic equipment
CN112988964B (en) * 2021-02-20 2024-03-08 平安科技(深圳)有限公司 Text prosody boundary prediction method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536664A (en) * 2017-03-01 2018-09-14 华东师范大学 The knowledge fusion method in commodity field
US20180268024A1 (en) * 2017-03-20 2018-09-20 International Business Machines Corporation Image support for cognitive intelligence queries
CN108647318A (en) * 2018-05-10 2018-10-12 北京航空航天大学 A kind of knowledge fusion method based on multi-source data
CN109886294A (en) * 2019-01-11 2019-06-14 平安科技(深圳)有限公司 Knowledge fusion method, apparatus, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810526B (en) * 2014-01-28 2016-09-21 北京仿真中心 A kind of knowledge fusion method based on D-S evidence theory
CN108804521B (en) * 2018-04-27 2021-05-14 南京柯基数据科技有限公司 Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536664A (en) * 2017-03-01 2018-09-14 华东师范大学 The knowledge fusion method in commodity field
US20180268024A1 (en) * 2017-03-20 2018-09-20 International Business Machines Corporation Image support for cognitive intelligence queries
CN108647318A (en) * 2018-05-10 2018-10-12 北京航空航天大学 A kind of knowledge fusion method based on multi-source data
CN109886294A (en) * 2019-01-11 2019-06-14 平安科技(深圳)有限公司 Knowledge fusion method, apparatus, computer equipment and storage medium

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036481B (en) * 2020-08-31 2024-04-05 国家电网有限公司 Reverse verification method for improving fusion effect
CN112036481A (en) * 2020-08-31 2020-12-04 国家电网有限公司 Reverse verification method for improving fusion effect
CN112182320B (en) * 2020-09-25 2023-12-26 中国建设银行股份有限公司 Cluster data processing method, device, computer equipment and storage medium
CN112182320A (en) * 2020-09-25 2021-01-05 中国建设银行股份有限公司 Clustering data processing method and device, computer equipment and storage medium
CN112784065A (en) * 2021-02-01 2021-05-11 东北大学 Unsupervised knowledge graph fusion method and unsupervised knowledge graph fusion device based on multi-order neighborhood attention network
CN112784065B (en) * 2021-02-01 2023-07-14 东北大学 Unsupervised knowledge graph fusion method and device based on multi-order neighborhood attention network
CN113111657A (en) * 2021-03-04 2021-07-13 浙江工业大学 Cross-language knowledge graph alignment and fusion method, device and storage medium
CN112949745A (en) * 2021-03-23 2021-06-11 中国检验检疫科学研究院 Fusion processing method and device for multi-source data, electronic equipment and storage medium
CN112949745B (en) * 2021-03-23 2024-04-19 中国检验检疫科学研究院 Fusion processing method and device for multi-source data, electronic equipment and storage medium
CN113468255B (en) * 2021-06-25 2023-04-07 西安电子科技大学 Knowledge graph-based data fusion method in social security comprehensive treatment field
CN113468255A (en) * 2021-06-25 2021-10-01 西安电子科技大学 Knowledge graph-based data fusion method in social security comprehensive treatment field
CN113723047A (en) * 2021-07-27 2021-11-30 山东旗帜信息有限公司 Map construction method, device and medium based on legal document
CN114139547B (en) * 2021-11-25 2023-07-04 北京中科闻歌科技股份有限公司 Knowledge fusion method, device, equipment, system and medium
CN114139547A (en) * 2021-11-25 2022-03-04 北京中科闻歌科技股份有限公司 Knowledge fusion method, device, equipment, system and medium
CN114625875A (en) * 2022-03-09 2022-06-14 平安科技(深圳)有限公司 Pattern matching method, device, storage medium and equipment for multi-data source information
CN114625875B (en) * 2022-03-09 2024-03-29 平安科技(深圳)有限公司 Pattern matching method, device, storage medium and equipment for multiple data source information
CN117033541B (en) * 2023-10-09 2023-12-19 中南大学 Space-time knowledge graph indexing method and related equipment
CN117033541A (en) * 2023-10-09 2023-11-10 中南大学 Space-time knowledge graph indexing method and related equipment

Also Published As

Publication number Publication date
CN109886294A (en) 2019-06-14
CN109886294B (en) 2024-01-23

Similar Documents

Publication Publication Date Title
WO2020143184A1 (en) Knowledge fusion method and apparatus, computer device, and storage medium
WO2020143326A1 (en) Knowledge data storage method, device, computer apparatus, and storage medium
US9773053B2 (en) Method and apparatus for processing electronic data
TWI496015B (en) Text matching method and device
US8719267B2 (en) Spectral neighborhood blocking for entity resolution
CN112037920A (en) Medical knowledge map construction method, device, equipment and storage medium
US20190347281A1 (en) Apparatus and method for semantic search
CN102123172B (en) Implementation method of Web service discovery based on neural network clustering optimization
JP2003030222A (en) Method and system for retrieving, detecting and identifying main cluster and outlier cluster in large scale database, recording medium and server
CN105404674B (en) Knowledge-dependent webpage information extraction method
JP2003288362A (en) Specified element vector generating device, character string vector generating device, similarity calculation device, specified element vector generating program, character string vector generating program, similarity calculation program, specified element vector generating method, character string vector generating method, and similarity calculation method
TWI656450B (en) Method and system for extracting knowledge from Chinese corpus
KR102059743B1 (en) Method and system for providing biomedical passage retrieval using deep-learning based knowledge structure construction
Korobkin et al. A multi-stage algorithm for text documents filtering based on physical knowledge
CN110019474B (en) Automatic synonymy data association method and device in heterogeneous database and electronic equipment
CN113486187A (en) Buddhism knowledge graph construction method, device, equipment and storage medium
Benny et al. Hadoop framework for entity resolution within high velocity streams
Xu et al. Redundant features removal for unsupervised spectral feature selection algorithms: An empirical study based on nonparametric sparse feature graph
JP5533272B2 (en) Data output device, data output method, and data output program
Wongchaisuwat Automatic keyword extraction using textrank
US20230097665A1 (en) Tag domain presentation device, tag domain presentation method, and information processing system using the same
CN114691845A (en) Semantic search method and device, electronic equipment, storage medium and product
Zhang et al. How to find valuable references? Application of text mining in abstract clustering
JP2005025465A (en) Document search method and device
Jahanbakhsh Gudakahriz et al. Opinion texts clustering using manifold learning based on sentiment and semantics analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19909177

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 01.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19909177

Country of ref document: EP

Kind code of ref document: A1