WO2020143184A1

WO2020143184A1 - Knowledge fusion method and apparatus, computer device, and storage medium

Info

Publication number: WO2020143184A1
Application number: PCT/CN2019/092597
Authority: WO
Inventors: 孙佳兴; 胡逸凡; 陈泽晖; 黄鸿顺
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-01-11
Filing date: 2019-06-24
Publication date: 2020-07-16
Also published as: CN109886294B; CN109886294A

Abstract

The present application relates to the technical field of knowledge graphs, in particular, to a knowledge fusion method and apparatus, a computer device, and a storage medium. The method comprises: obtaining multiple pieces of knowledge data in a knowledge data source; extracting entity data in any knowledge data, and performing vectorization conversion on the entity data to generate a multi-dimensional word vector; performing dimensionality reduction on the multi-dimensional word vector to obtain a two-dimensional word vector, transposing the two-dimensional word vector and then multiplying by the original two-dimensional word vector to obtain an entity data matrix, elements in the entity data matrix being vectorized entity data; obtaining an attribute value of real attribute data; and inputting the elements in the entity data matrix and the attribute value of the real attribute data as parameters into a credibility identification model, obtaining the credibility of the knowledge data after parameter output, comparing the credibility with a preset credibility threshold value, and performing fusion. The present application achieves effective fusion of multiple attributes in the same entity.

Description

Knowledge fusion method, device, computer equipment and storage medium

This application requires the priority of the Chinese patent application submitted to the China Patent Office on January 11, 2019, with the application number 201910025114.4 and the invention titled "Knowledge Fusion Methods, Devices, Computer Equipment, and Storage Media", the entire contents of which are incorporated by reference In this application.

Technical field

The present application relates to the field of knowledge graph technology, and in particular, to a knowledge fusion method, device, computer equipment, and storage medium.

Background technique

There is a lot of knowledge on the Internet today, and there are various styles of knowledge data in the data information contained in each web page. Among them, knowledge data is composed of three parts, namely: entity information, relationship information and attribute information. When sorting out knowledge data, it is necessary to fuse knowledge data. This process is called knowledge fusion.

Knowledge fusion refers to the discovery of different expressions of the same concept in heterogeneous databases. It organizes and manages distributed data sources and knowledge sources, and transforms, integrates, and integrates knowledge elements in accordance with application requirements to obtain valuable information. Or available new knowledge, at the same time optimize the structure and connotation of knowledge objects, and provide knowledge-based services. The research of knowledge fusion has certain value for knowledge sharing, knowledge system interaction, integration and collaborative work, and optimization of knowledge service quality in the distributed knowledge base environment. It is also useful for researching knowledge discovery based on knowledge connotation and creation of new knowledge. , Organization, evaluation and optimization are of great significance.

At present, in the process of knowledge fusion, there is a problem that the attribute cannot be accurately judged, so that in the process of fusion, it is impossible to effectively merge multiple attributes belonging to the same entity.

Summary of the invention

Based on this, it is necessary to provide a knowledge fusion method, device, computer equipment, and storage medium for the problem that multiple attributes belonging to the same entity cannot be effectively merged.

A method of knowledge fusion, including:

Obtaining several pieces of knowledge data from the source of knowledge data; extracting entity data from any of the knowledge data, vectorizing the entity data to generate multidimensional word vectors; reducing the dimension of the multidimensional word vectors to obtain two Dimension word vector, multiplying the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, the elements in the entity data matrix are vectorized entity data; extract any of the knowledge data The original attribute data in the original attribute data is filtered to obtain real attribute data, and the attribute values of the real attribute data are obtained; the elements in the entity data matrix and the attribute values of the real attribute data are entered into the parameter Go to the credibility recognition model, get the credibility of the knowledge data after taking out the parameters, compare the credibility with the preset credibility threshold, if it is greater than the credibility threshold, it will be extracted The original attribute data is fused, otherwise it is not fused.

A knowledge fusion device includes the following modules:

The data acquisition module is set to acquire several knowledge data from the source of knowledge data;

The vector generation module is configured to extract entity data from any of the knowledge data, convert the entity data into vectors, and generate multidimensional word vectors;

A data vectorization module, configured to reduce the dimension of the multi-dimensional word vector to obtain a two-dimensional word vector, multiply the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, and the entity The elements in the data matrix are vectorized entity data;

The attribute value obtaining module is set to extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain the attribute value of the real attribute data;

The fusion determination module is configured to input the elements in the entity data matrix and the attribute values of the real attribute data into the credibility recognition model, and after obtaining the parameters, obtain the credibility of the knowledge data, and then convert the The reliability is compared with a preset reliability threshold, and if it is greater than the reliability threshold, the extracted original attribute data is fused, otherwise it is not fused.

A computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the steps of the knowledge fusion method are caused.

A storage medium storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, causes the one or more processors to perform the steps of the above knowledge fusion method.

The above knowledge fusion method, device, computer equipment, and storage medium include: acquiring several pieces of knowledge data from a source of knowledge data; extracting entity data from any of the knowledge data, and vectorizing the entity data to generate multidimensional Word vector; reducing the dimension of the multi-dimensional word vector to obtain a two-dimensional word vector, and transposing the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix. The elements are vectorized entity data; extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain the attribute values of the real attribute data; convert the entity data The elements in the matrix and the attribute values of the real attribute data are entered into the credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, and the credibility is compared with a preset credibility threshold In comparison, if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused. The technical solution realizes the effective fusion of multiple attributes of the same entity through accurate matching of the entities and attributes.

BRIEF DESCRIPTION

By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application.

FIG. 1 is an overall flowchart of a knowledge fusion method in an embodiment of this application;

2 is a schematic diagram of a data acquisition process of a knowledge fusion method in an embodiment of the present application;

3 is a schematic diagram of a vector generation process of a knowledge fusion method in an embodiment of the present application;

FIG. 4 is a structural diagram of a knowledge fusion device in an embodiment of the present application.

detailed description

In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

Those skilled in the art can understand that unless specifically stated, the singular forms "a", "an", "said" and "the" used herein may also include the plural forms. It should be further understood that the word "comprising" used in the description of this application refers to the presence of the described features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and/or their groups.

FIG. 1 is an overall flowchart of a knowledge fusion method in an embodiment of the present application. As shown in FIG. 1, a knowledge fusion method includes:

S1, obtaining several knowledge data from the source of knowledge data;

Specifically, the knowledge data in this step may come from the same knowledge data source, or may come from different data sources, may come from local data, or may come from network data. If it comes from local data, you need to get the storage path of the knowledge data when you get the knowledge data; if you come from the network data source, you need to get the network address of the knowledge data source when you get the knowledge data.

S2, extracting entity data in any of the knowledge data, vectorizing the entity data, and generating a multi-dimensional word vector;

Specifically, the entity name list stored in the database is obtained, at least one entity name in the entity name list is randomly extracted, and entity data is extracted from the knowledge data according to the entity name. At the same time, when extracting entity data, you can use the method of synonym extraction; for example, the entity name extracted from the entity name list is "basketball", then when extracting the entity data in the knowledge data, you can extract The entity names of the entity data are "soccer", "volleyball" and other ball sports nouns.

S3, performing dimension reduction on the multi-dimensional word vector to obtain a two-dimensional word vector, transposing the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, and elements in the entity data matrix Vectorized entity data;

Specifically, PCA can be used to reduce the dimension of multidimensional word vectors. For example, if there are m pieces of n-dimensional data, the following steps can be used to reduce the dimension:

1) Form the original data into n rows and m columns to form a matrix X; 2) Zero-average each row of X (representing an attribute field), that is, subtract the average of this row; 3) Find the X matrix Variance matrix Y; 4) Find the eigenvalues of the covariance matrix Y and the corresponding eigenvectors r; 5) Arrange the eigenvectors according to the size of the corresponding eigenvalues from top to bottom into a matrix Z by row 6) The matrix Q is the data after dimensionality reduction to k-dimension.

S4. Extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain attribute values of the real attribute data;

Specifically, when filtering, words that have nothing to do with semantics are filtered out. The original attribute data can be divided into several sub-data segments, and then the attribute word query can be performed on the data in each sub-data segment. If there is no attribute word, the sub-data segment is cleared.

S5, the elements in the entity data matrix and the attribute values of the real attribute data are entered into a credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, and the credibility is predicted The set credibility threshold is compared, and if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused.

Specifically, credibility is reliability, which refers to the degree of consistency of the results obtained when the same method is repeatedly measured on the same object. On the other hand, credibility refers to the reliability of measured data. Among them, the preset credibility threshold is obtained based on historical data statistics, and the general credibility threshold is set to 95%.

In this embodiment, by effectively processing the entity data and the attribute data, an effective fusion of multiple attributes of the same entity is achieved.

FIG. 2 is a schematic diagram of a data acquisition process of a knowledge fusion method in an embodiment of the present application. As shown in the figure, the S1 acquires several knowledge data from a source of knowledge data, including:

S101. Send a knowledge data extraction instruction to the source of the knowledge data to be extracted;

Specifically, the network address of the source of the knowledge data of the knowledge data to be extracted is obtained, and the type of the network address is determined according to the format of the network address, that is, whether the network address is a static IP address or a dynamic IP address, if it is a static IP Address, then retrieve the IP address table from the database for comparison, determine whether the static IP address is on the IP address table, send a knowledge data acquisition instruction when it is, and not send it if it is not; if it is a dynamic IP address, then Performing DNS resolution on the dynamic IP address to obtain a DNS resolution code corresponding to the dynamic IP address, and then calling a DNS resolution code table in a database to compare the DNS resolution code to determine whether the DNS resolution code is in the On the DNS resolution code table, the knowledge data acquisition instruction is sent when it is not, and it is not sent when it is not.

S102. Receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the type of the knowledge source data source according to the keywords;

Specifically, the formal keyword refers to whether the knowledge data is structured data, semi-structured data, or unstructured data. For example, in the feedback information, the keyword of the form "table" corresponds to structured data; the keyword of the form "webpage" corresponds to semi-structured data; the key of the form "text" appears Words correspond to unstructured data.

S103: Acquire an extraction method corresponding to the type of the knowledge data source, and extract several knowledge data of the knowledge data source according to the extraction method.

Specifically, different forms of data sources correspond to different data extraction methods. For example, semi-structured web page information is usually crawled by web crawlers. For unstructured text, text language is usually used for extraction.

In this embodiment, by analyzing the feedback information of the source of knowledge data, the data form of the source of knowledge data is determined, so that the knowledge data of the source of knowledge data can be extracted by using the correct extraction method.

FIG. 3 is a schematic diagram of a vector generation process of a knowledge fusion method in an embodiment of the present application. As shown in the figure, the S2 extracts any entity data in the knowledge data, and vectors the entity data Conversion to generate multidimensional word vectors, including:

S201. Set an initial segment for extracting entity data in the knowledge data, where the initial segment contains at least one entity data;

Specifically, the length of the initial segment of the entity data is set according to the historical data of the length value of the entity words in the entity data. For example, in the historical data stored in the database, the length of the entity word is from 1 to 10, then the length of the initial segment is set to a maximum value of 10.

S202. Divide the knowledge data into several initial sub-data blocks according to the segment length of the initial segment. If any one of the initial sub-data blocks contains two or more entity data, the Dividing the initial sub-data block again to obtain a final sub-data block containing only one of the entity data;

Specifically, when the initial segment is segmented, the length of each sub-data block may be inconsistent, that is, the length of each sub-data block is determined according to the length of actual entity words.

S203. Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final The segment length of the sub-data block is used as a coefficient to multiply the initial multi-dimensional word vector to obtain the final multi-dimensional word vector.

Specifically, semantic features include elements such as semantics, grammar, and structure. The word vector conversion method usually uses the Word2Vector algorithm. This algorithm can link each semantic feature up and down, thereby transforming the related semantic features into an initial multi-dimensional word vector.

In this embodiment, after vectorization conversion is performed on the entity data, the entity data is numerically represented, which is convenient for using a machine learning method to perform similarity calculation.

In one embodiment, in S3, the two-dimensional word vector is reduced to obtain a two-dimensional word vector, and the two-dimensional word vector is transposed and the original two-dimensional word vector is multiplied to obtain an entity data matrix, The elements in the entity data matrix are vectorized entity data, including:

Acquiring K nearest neighbors of each sample point in the multi-dimensional word vector;

Specifically, the sample points refer to each point in the multi-dimensional vector; each sample point in the multi-dimensional space N has directly connected points on the same plane, these points become nearest neighbors, and the value range of K is 1~ n, n is a non-zero positive integer.

According to the K nearest neighbors of each sample point, the local weight matrix W _i ={w _i1 ,w _i2 ,...,w _iK } of each sample point is established;

According to the local weight matrix W _i ={w _i1 ,w _i2 ,...,w _iK } of each sample point, each sample point is mapped to a low-dimensional space, and the mapping conditions are:

Where: ε(Y) is the value of the loss function, y _ij is the value of the neighbor, y _n is the output vector of the neighbor, W _ij is the element in the local weight matrix, K is the number of neighbors, and N is the output vector of the neighbor The number of elements in the map, after mapping, a two-dimensional word vector Y={y ₁ ,y ₂ ,...,y _N } is obtained; after transposing the two-dimensional word vector to obtain a transposed two-dimensional word vector, the two The product of the dimension word vector and the transposed two-dimensional word vector obtains an entity data matrix, and the elements in the entity data matrix are vectorized entity data.

In this embodiment, it is convenient to match the entity information and the attribute information by reducing the dimension of the multi-dimensional word vector into a two-dimensional word vector.

In one embodiment, in S4, extracting original attribute data in any of the knowledge data, filtering the original attribute data to obtain real attribute data, and obtaining attribute values of the real attribute data, including:

Extracting original attribute data in any of the knowledge data, and discretely processing the original attribute data to obtain discrete values of the original attribute data;

Specifically, discretization refers to the mapping of finite individuals in infinite space into a limited space to improve the space-time efficiency of the algorithm. Before discretization, you can use the unique() deduplication function to remove duplicate data in the knowledge data, and then discretize the knowledge data. Among them, the unique() function is developed by C++, PHP, Matlab, etc. or The deduplication function supported by the scientific computing environment is used to remove duplicate values in a set, or take a single value from a set.

Acquiring the vector dimension corresponding to the original attribute data according to the amount of the original attribute data in the knowledge data;

Among them, the vector dimension of the original attribute data is equal to the quantity of the original attribute data.

Make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, then the original attribute data is real attribute data, if the difference is not within the error threshold, based on The difference value removes redundant attribute data in the original attribute data to obtain the real attribute data;

Obtain the vector dimension corresponding to the real attribute data according to the quantity of the real attribute data, and establish a real attribute data vector;

After reducing the dimension of the real attribute data vector to form a real attribute data matrix, the characteristic value of the real attribute data matrix is obtained, and the characteristic value is the attribute value.

Specifically, after reducing the dimension of the real attribute data, a two-dimensional attribute vector can be obtained, the two-dimensional attribute vector can be transposed to obtain a transposed two-dimensional attribute vector, and the product of the two-dimensional attribute vector and the transposed two-dimensional attribute vector can be multiplied. Then get the real attribute vector.

In this embodiment, by reducing the dimension of the original attribute data and performing matrix processing, the real attribute value is better obtained.

In one embodiment, in S5, the elements in the entity data matrix and the attribute values of the real attribute data are entered into a credibility recognition model, and after obtaining the parameters, the credibility of the knowledge data is obtained, Comparing the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise not fused, including:

Obtain the attribute value of any element in the entity data matrix and any real attribute data, enter the element and the attribute value into a similarity distance function to calculate the similarity distance, the calculation formula is:

In the formula: L(m ₁ , m ₂ ) is the similarity distance function, m ₁ is the element, m ₂ is the attribute value;

Based on the similarity distance, the credibility of the element and the attribute value is calculated, and the calculation formula is:

Where: Crd(m) is a credibility function, L(m ₁ , m ₂ ) is a similarity distance function; compare the credibility with a preset credibility threshold, if it is greater than the credibility The degree threshold is to fuse the original attribute data corresponding to the extracted same entity data, otherwise it will not fuse.

Specifically, the cosine algorithm or the Euclidean distance algorithm can also be used in the similarity calculation. The credibility threshold is obtained based on historical data statistics.

In this embodiment, by calculating the credibility of the entity data and attribute data, the accuracy of attribute data fusion is improved.

In one embodiment, in S103, an extraction method corresponding to the type of the knowledge data source is obtained, and extracting several pieces of knowledge data of the knowledge data source according to the extraction method includes:

If the form of acquiring the source of the knowledge data is a web page, then extracting using a web crawler tool includes:

Keyword group in the task queue for obtaining pre-extracted knowledge data, the keyword group contains multiple keywords; among them, the keyword group in the task queue may be some trait phrases, such as: "ball", in this The keywords included under the keyword group may include "basketball", "football", "table tennis" and so on.

Traverse the keyword group, crawl the information on the web page corresponding to each keyword in the keyword group through a web crawler; obtain all entity information in the information on the web page, and import the entity information into a preset In the knowledge data table of, if there is one or more entity information that cannot be imported into the preset knowledge data table, the web crawler is used to crawl the web page again, otherwise the web page information is used as the knowledge data.

Specifically, the entity information refers to the information related to the "entity" such as the name of the entity. When imported into the preset knowledge data table, the entity name in the preset knowledge data table is retrieved first, if a certain entity information If the entity name in is not in the preset knowledge data table, the entity information cannot be imported. Among them, the preset knowledge data table is stored in the database, which is collected after collecting all previous knowledge data.

In this embodiment, the required knowledge data can be effectively extracted from the web page information, and the efficiency of knowledge data extraction can be improved.

In one embodiment, a knowledge fusion device is proposed. As shown in FIG. 4, it includes the following modules:

The data acquisition module 41 is configured to acquire several knowledge data from the source of knowledge data;

The vector generation module 42 is configured to extract entity data from any of the knowledge data, convert the entity data into vectors, and generate multidimensional word vectors;

The data vectorization module 43 is configured to obtain a two-dimensional word vector after reducing the dimension of the multi-dimensional word vector, and multiply the two-dimensional word vector with the original two-dimensional word vector to obtain an entity data matrix. The elements in the entity data matrix are vectorized entity data;

The attribute value obtaining module 44 is set to extract original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain attribute values of the real attribute data;

The fusion determination module 45 is configured to input the elements in the entity data matrix and the attribute values of the real attribute data into the credibility recognition model, obtain the credibility of the knowledge data after the parameters are obtained, and convert the The credibility is compared with a preset credibility threshold, and if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise it is not fused.

In one embodiment, the vector generation module is further set to:

Send a knowledge data extraction instruction to the knowledge data source of the knowledge data to be extracted; receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the keywords according to the keywords Knowledge comes from the type of the data source; obtain the extraction method corresponding to the type of the knowledge data source, and extract several pieces of knowledge data from the knowledge data source according to the extraction method.

In one embodiment, the data acquisition module is further set to:

Setting an initial segment for extracting entity data in the knowledge data, the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data The segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.

In one embodiment, the data vectorization module is further set to:

Obtain the K nearest neighbors of each sample point in the multi-dimensional word vector; according to the K nearest neighbors of each sample point, establish a local weight matrix W _i ={w _i1 ,w _{i2 of} each sample point, …,W _iK };

According to the local weight matrix W _i ={w _i1 ,w _i2 ,...,w _iK } of each sample point, each sample point is mapped to a low-dimensional space, and the two-dimensional word vector Y={y _{1 is} obtained after the mapping ,y ₂ ,...,y _N }; transpose the two-dimensional word vector to obtain a transposed two-dimensional word vector, and multiply the two-dimensional word vector and the transposed two-dimensional word vector to obtain an entity data matrix, The elements in the entity data matrix are vectorized entity data.

In one embodiment, the attribute value acquisition module is also set to:

Extract the original attribute data in any one of the knowledge data, and discretely process the original attribute data to obtain the discrete value of the original attribute data; according to the number of the original attribute data in the knowledge data, obtain The vector dimension corresponding to the original attribute data; make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, the original attribute data is real attribute data, if the difference If the value is not within the error threshold, the redundant attribute data in the original attribute data is removed according to the difference to obtain the real attribute data; according to the number of the real attribute data, the corresponding Vector dimension, establish real attribute data vector;

In one embodiment, the fusion determination module is further configured to:

Obtain the attribute value of any element in the entity data matrix and any real attribute data, and enter the element and the attribute value into a similarity distance function to calculate a similarity distance; according to the similarity Distance, calculate the credibility of the element and the attribute value; compare the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the same entity data will be extracted The corresponding original attribute data is fused, otherwise it is not fused.

In one embodiment, the vector generation module is further set to:

Obtain the keyword group in the task queue of pre-extracted knowledge data, the keyword group contains multiple keywords; traverse the keyword group, crawl a webpage corresponding to each keyword in the keyword group through a web crawler Information; obtain all the entity information in the information on the web page, import the entity information into the preset knowledge data table, if there is one or more entity information cannot be imported into the preset knowledge data table, Then crawl the webpage through the web crawler again, otherwise the webpage information is used as the knowledge data.

In one embodiment, a computer device is proposed. The computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the computer device The processor executes the steps of the knowledge fusion method described in the above embodiments.

In one embodiment, a storage medium storing computer-readable instructions is proposed. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the above-mentioned embodiments. Describe the steps of the knowledge fusion method. The storage medium may be a non-volatile storage medium.

A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the foregoing embodiments may be completed by instructing relevant hardware through a program. The program may be stored in a computer-readable storage medium, and the storage medium may include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

The technical features of the above-mentioned embodiments can be arbitrarily combined. In order to simplify the description, all possible combinations of the technical features in the above-mentioned embodiments are not described. It should be considered as the scope described in this specification.

The above-mentioned embodiments only express some exemplary embodiments of the present application, and the description thereof is more specific and detailed, but it should not be construed as limiting the patent scope of the present application. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, a number of modifications and improvements can also be made, which all fall within the protection scope of the present application. Therefore, the protection scope of the patent of this application shall be subject to the appended claims.

Claims

A knowledge fusion method, which includes:

Obtaining several pieces of knowledge data from the source of knowledge data; extracting entity data from any of the knowledge data, vectorizing the entity data to generate multidimensional word vectors; reducing the dimension of the multidimensional word vectors to obtain two Dimension word vector, multiplying the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, the elements in the entity data matrix are vectorized entity data; extract any of the knowledge data The original attribute data in the original attribute data is filtered to obtain real attribute data, and the attribute values of the real attribute data are obtained; the elements in the entity data matrix and the attribute values of the real attribute data are entered into the parameter Go to the credibility recognition model, get the credibility of the knowledge data after taking out the parameters, compare the credibility with the preset credibility threshold, if it is greater than the credibility threshold, it will be extracted The original attribute data is fused, otherwise it is not fused.
The knowledge fusion method according to claim 1, wherein the acquiring knowledge data in the source of knowledge data includes:

Send a knowledge data extraction instruction to the knowledge data source of the knowledge data to be extracted; receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the keywords according to the keywords Knowledge comes from the type of the data source; obtain the extraction method corresponding to the type of the knowledge data source, and extract several pieces of knowledge data from the knowledge data source according to the extraction method.
The knowledge fusion method according to claim 1, wherein the extracting entity data in any of the knowledge data, vectorizing the entity data to generate a multi-dimensional word vector includes:

Setting an initial segment for extracting entity data in the knowledge data, the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data The segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.
The knowledge fusion method according to claim 1, wherein the two-dimensional word vector is obtained by performing dimension reduction on the multi-dimensional word vector, and the two-dimensional word vector is transposed and multiplied by the original two-dimensional word vector An entity data matrix is obtained, and the elements in the entity data matrix are vectorized entity data, including:

Obtain the K nearest neighbors of each sample point in the multi-dimensional word vector; according to the K nearest neighbors of each sample point, establish a local weight matrix W i ={w i1 ,w i2 of each sample point, ..., w iK }; according to the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point, map each sample point to a low-dimensional space, and the mapping conditions are:

Where: ε(Y) is the value of the loss function, y ij is the value of the nearest neighbor, y n is the output vector of the nearest neighbor, w ij is the element in the local weight matrix, K is the number of neighbors, and N is the output of the neighbor The number of elements in the vector, after mapping, a two-dimensional word vector Y={y 1 ,y 2 ,...,y N } is obtained; after transposing the two-dimensional word vector to obtain a transposed two-dimensional word vector, the The product of the two-dimensional word vector and the transposed two-dimensional word vector obtains an entity data matrix, and the elements in the entity data matrix are vectorized entity data.
The knowledge fusion method according to claim 1, wherein the original attribute data in any of the knowledge data is extracted, the original attribute data is filtered to obtain real attribute data, and the attributes of the real attribute data are obtained Values, including:

Extract the original attribute data in any one of the knowledge data, and discretely process the original attribute data to obtain the discrete value of the original attribute data; according to the number of the original attribute data in the knowledge data, obtain The vector dimension corresponding to the original attribute data; make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, the original attribute data is real attribute data, if the difference If the value is not within the error threshold, the redundant attribute data in the original attribute data is removed according to the difference to obtain the real attribute data; according to the number of the real attribute data, the corresponding The vector dimension establishes the real attribute data vector; the real attribute data vector is reduced in dimension to form a real attribute data matrix, and the characteristic values of the real attribute data matrix are obtained, and the characteristic values are the attribute values.
The knowledge fusion method according to claim 1, wherein the inputting the elements in the entity data matrix and the attribute values of the real attribute data into a credibility recognition model, and obtaining the knowledge data after taking out the parameters The credibility, compare the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the extracted original attribute data is fused, otherwise not fused, including:

Obtain the attribute value of any element in the entity data matrix and any real attribute data, and enter the element and the attribute value into a similarity distance function to calculate the similarity distance. The calculation formula is:
Where: L(m 1 , m 2 ) is a similarity distance function, m 1 is an element, and m 2 is an attribute value; according to the similarity distance, the credibility of the element and the attribute value is calculated, The calculation formula is:

Where: Crd(m) is a credibility function, L(m 1 , m 2 ) is a similarity distance function; compare the credibility with a preset credibility threshold, if it is greater than the credibility The degree threshold is to fuse the original attribute data corresponding to the extracted same entity data, otherwise it will not fuse.
The knowledge fusion method according to claim 2, wherein the extracting method corresponding to the type of acquiring the knowledge data source, extracting several pieces of knowledge data of the knowledge data source according to the extracting method includes: if the acquiring The form of the source of the knowledge data is a web page, and the extraction is performed using a web crawler tool, including: obtaining a keyword group in a task queue of pre-extracted knowledge data, the keyword group contains multiple keywords; , Crawling information on a webpage corresponding to each keyword in the keyword group through a web crawler; acquiring all entity information in the information on the webpage, and importing the entity information into a preset knowledge data table If there is one or more entity information that cannot be imported into the preset knowledge data table, the web crawler is used to crawl the web page again; otherwise, the web page information is used as the knowledge data.
A knowledge fusion device, including:

The data acquisition module is set to acquire several knowledge data from the source of knowledge data;

The vector generation module is configured to extract entity data from any of the knowledge data, convert the entity data into vectors, and generate multidimensional word vectors;

A data vectorization module, configured to reduce the dimension of the multi-dimensional word vector to obtain a two-dimensional word vector, multiply the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, and the entity The elements in the data matrix are vectorized entity data;

The attribute value obtaining module is set to extract the original attribute data in any of the knowledge data, filter the original attribute data to obtain real attribute data, and obtain the attribute value of the real attribute data;

The fusion determination module is configured to input the elements in the entity data matrix and the attribute values of the real attribute data into the credibility recognition model, and after obtaining the parameters, obtain the credibility of the knowledge data, and then convert the The reliability is compared with a preset reliability threshold, and if it is greater than the reliability threshold, the extracted original attribute data is fused, otherwise it is not fused.
The knowledge fusion device according to claim 8, wherein the vector generation module is further configured to:

Send a knowledge data extraction instruction to the knowledge data source of the knowledge data to be extracted; receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the keywords according to the keywords Knowledge comes from the type of the data source; obtain the extraction method corresponding to the type of the knowledge data source, and extract several pieces of knowledge data from the knowledge data source according to the extraction method.
The knowledge fusion device according to claim 8, wherein the data acquisition module is further configured to:

Setting an initial segment for extracting entity data in the knowledge data, the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data The segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.
The knowledge fusion device according to claim 8, wherein the data vectorization module is further configured to:

Obtain the K nearest neighbors of each sample point in the multi-dimensional word vector; according to the K nearest neighbors of each sample point, establish a local weight matrix W i ={w i1 ,w i2 of each sample point, …,W iK };

According to the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point, each sample point is mapped to a low-dimensional space, and the two-dimensional word vector Y={y 1 is obtained after the mapping ,y 2 ,...,y N }; transpose the two-dimensional word vector to obtain a transposed two-dimensional word vector, and multiply the two-dimensional word vector and the transposed two-dimensional word vector to obtain an entity data matrix, The elements in the entity data matrix are vectorized entity data.
The knowledge fusion device according to claim 8, wherein the attribute value acquisition module is further set to:

Extract the original attribute data in any one of the knowledge data, and discretely process the original attribute data to obtain the discrete value of the original attribute data; according to the number of the original attribute data in the knowledge data, obtain The vector dimension corresponding to the original attribute data; make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, the original attribute data is real attribute data, if the difference If the value is not within the error threshold, the redundant attribute data in the original attribute data is removed according to the difference to obtain the real attribute data; according to the number of the real attribute data, the corresponding Vector dimension, establish real attribute data vector;

After reducing the dimension of the real attribute data vector to form a real attribute data matrix, the characteristic value of the real attribute data matrix is obtained, and the characteristic value is the attribute value.
The knowledge fusion device according to claim 8, wherein the fusion determination module is further configured to:

Obtain the attribute value of any element in the entity data matrix and any real attribute data, and enter the element and the attribute value into a similarity distance function to calculate a similarity distance; according to the similarity Distance, calculate the credibility of the element and the attribute value; compare the credibility with a preset credibility threshold, if it is greater than the credibility threshold, the same entity data will be extracted The corresponding original attribute data is fused, otherwise it is not fused.
The knowledge fusion device according to claim 9, wherein the vector generation module is further configured to:

Obtain the keyword group in the task queue of pre-extracted knowledge data, the keyword group contains multiple keywords; traverse the keyword group, crawl a webpage corresponding to each keyword in the keyword group through a web crawler Information; obtain all the entity information in the information on the web page, import the entity information into the preset knowledge data table, if there is one or more entity information cannot be imported into the preset knowledge data table, Then crawl the webpage through the web crawler again, otherwise the webpage information is used as the knowledge data.
A computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor causes the processor to perform the following steps:

Obtaining several pieces of knowledge data from the source of knowledge data; extracting entity data from any of the knowledge data, vectorizing the entity data to generate multidimensional word vectors; reducing the dimension of the multidimensional word vectors to obtain two Dimension word vector, multiplying the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, the elements in the entity data matrix are vectorized entity data; extract any of the knowledge data The original attribute data in the original attribute data is filtered to obtain real attribute data, and the attribute values of the real attribute data are obtained; the elements in the entity data matrix and the attribute values of the real attribute data are entered into the parameter Go to the credibility recognition model, get the credibility of the knowledge data after taking out the parameters, compare the credibility with the preset credibility threshold, if it is greater than the credibility threshold, it will be extracted The original attribute data is fused, otherwise it is not fused.
A storage medium storing computer-readable instructions, which when executed by one or more processors, causes the one or more processors to perform the following steps:

Obtain several pieces of knowledge data from the source of knowledge data; extract the entity data in any of the knowledge data, convert the entity data into vectors to generate multidimensional word vectors; reduce the multidimensional word vectors to obtain two Dimension word vector, multiplying the two-dimensional word vector and the original two-dimensional word vector to obtain an entity data matrix, the elements in the entity data matrix are vectorized entity data; extract any of the knowledge data The original attribute data in the filter, filtering the original attribute data to obtain real attribute data, and obtaining the attribute values of the real attribute data; adding the elements in the entity data matrix and the attribute values of the real attribute data into the parameter Go to the credibility recognition model, get the credibility of the knowledge data after taking out the parameters, compare the credibility with the preset credibility threshold, if it is greater than the credibility threshold, it will be extracted The original attribute data is fused, otherwise it is not fused.
A storage medium storing computer readable instructions according to claim 16, wherein, when acquiring several pieces of knowledge data from a source of knowledge data, the processor is caused to perform the following steps:

Send a knowledge data extraction instruction to the knowledge data source of the knowledge data to be extracted; receive feedback information of the knowledge data source, extract keywords of the data source type from the feedback information, and determine the keywords according to the keywords Knowledge comes from the type of the data source; obtain the extraction method corresponding to the type of the knowledge data source, and extract several pieces of knowledge data from the knowledge data source according to the extraction method.
A storage medium storing computer readable instructions according to claim 16, wherein the entity data in any of the knowledge data is extracted, and the entity data is vectorized to generate a multi-dimensional word vector So that the processor performs the following steps:

Setting an initial segment for extracting entity data in the knowledge data, the initial segment containing at least one of the entity data; according to the segment length of the initial segment, the knowledge data is divided into several An initial sub-data block, if any one of the initial sub-data blocks contains two or more entity data, the initial sub-data block is divided again to obtain a final sub-data block containing only one of the entity data; Extract the entity data in the final sub-data block, extract the semantic features of the entity data in the final sub-data block, apply the word vector conversion method to convert the semantic features into an initial multi-dimensional word vector, and convert the final sub-data The segment length of the block is used as a coefficient to multiply the initial multidimensional word vector to obtain the final multidimensional word vector.
A storage medium storing computer readable instructions according to claim 16, wherein said dimensionality reduction of said multidimensional word vector obtains a two-dimensional word vector, and after transposing said two-dimensional word vector The original two-dimensional word vector is multiplied to obtain an entity data matrix. When the elements in the entity data matrix are vectorized entity data, the processor is caused to perform the following steps:

Obtain the K nearest neighbors of each sample point in the multi-dimensional word vector; according to the K nearest neighbors of each sample point, establish a local weight matrix W i ={w i1 ,w i2 of each sample point, …, w iK }; according to the local weight matrix W i ={w i1 ,w i2 ,...,w iK } of each sample point, map each sample point to a low-dimensional space, and the two-dimensional words are obtained after the mapping Vector Y={y 1 ,y 2 ,...,y N }; transpose the two-dimensional word vector to obtain a transposed two-dimensional word vector, and convert the two-dimensional word vector and the transposed two-dimensional word vector The product obtains an entity data matrix, and the elements in the entity data matrix are vectorized entity data.
A storage medium storing computer-readable instructions according to claim 16, wherein the original attribute data in any of the knowledge data is extracted, and the original attribute data is filtered to obtain real attribute data, When acquiring the attribute value of the real attribute data, the processor is caused to perform the following steps:

Extract the original attribute data in any one of the knowledge data, and discretely process the original attribute data to obtain the discrete value of the original attribute data; according to the number of the original attribute data in the knowledge data, obtain The vector dimension corresponding to the original attribute data; make a difference between the discrete value and the vector dimension, if the difference is within a preset error threshold, the original attribute data is real attribute data, if the difference If the value is not within the error threshold, the redundant attribute data in the original attribute data is removed according to the difference to obtain the real attribute data; according to the number of the real attribute data, the corresponding The vector dimension establishes the real attribute data vector; the real attribute data vector is reduced in dimension to form a real attribute data matrix, and the characteristic values of the real attribute data matrix are obtained, and the characteristic values are the attribute values.