WO2020114022A1

WO2020114022A1 - Knowledge base alignment method and apparatus, computer device and storage medium

Info

Publication number: WO2020114022A1
Application number: PCT/CN2019/103487
Authority: WO
Inventors: 吴壮伟
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-12-04
Filing date: 2019-08-30
Publication date: 2020-06-11
Also published as: CN109783582A; CN109783582B

Abstract

Disclosed in the embodiments of the present application are a knowledge base alignment method and apparatus, a computer device and a storage medium. The method comprises the following steps: obtaining a knowledge entity vector set, wherein the knowledge entity vector set is a vectorized representation of knowledge entities in a knowledge base to be aligned; inputting the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of the knowledge entities in said knowledge base; selecting any two knowledge entities of the same type according to the clustering result, and calculating the similarity between the two knowledge entities; and if the similarity is greater than a set first threshold, combining the two knowledge entities. The comparison for the similarity between two knowledge entities is limited to the entities of the same type, so the calculation amount is greatly reduced; during clustering, the clustering is achieved by means of an artificial intelligence technology, so that the clustering result is more in line with the expectation; the similarity calculation integrates the attribute similarity and vector similarity of the entities, thus the similarity calculation is more reasonable, and redundant information can be more effectively found and removed.

Description

Knowledge base alignment method, device, computer equipment and storage medium

This application is based on the Chinese invention patent application with the application number 201811474699X filed on December 4, 2018, titled "A Knowledge Base Alignment Method, Device, Computer Equipment, and Storage Media", and claims its priority.

【Technical Field】

This application relates to the technical field of knowledge base processing, and in particular, to a knowledge base alignment method, device, computer equipment, and storage medium.

【Background technique】

With the development of the Internet, more and more knowledge bases have been constructed in various fields, and these knowledge bases are also widely used in Internet applications such as search services and automatic question answering. The knowledge base has a positive meaning for the sharing and dissemination of information. However, the information of a single knowledge base is limited, and in some cases it cannot meet the needs of users; and usually the knowledge base is continuously expanded, and the scale of occupied storage resources continues to expand, but the data continuously expanded into the knowledge base may be redundant In addition, this redundancy causes a waste of storage resources. At the same time, it also increases the amount of search calculations and duplicates the search result information, causing inconvenience to users.

Knowledge Base (Alignment) refers to finding entities belonging to the same thing in reality for each entity from different sources. The entity here refers to things that exist objectively and can be distinguished from each other, including concrete people, things, things, abstract concepts, relationships. Therefore, knowledge base alignment, that is, extracting entity information and removing redundancy, is a key issue in building a high-quality knowledge base.

The common method of knowledge base alignment is to use the attribute information of the entity to determine whether different source entities can be aligned. Since the data of different entities belongs to the type of user-generated content (User Generated Content, UGC), the quality of the data edited by different users is uneven, only through The entity attribute information edited by the user is difficult to accurately determine whether it is the same entity.

[Invention content]

Embodiments of the present application provide a knowledge base alignment method, apparatus, computer equipment, and storage medium.

A knowledge base alignment method, the knowledge base alignment method includes:

Acquiring a set of knowledge entity vectors, where the set of knowledge entity vectors is a vectorized representation of knowledge entities in the knowledge base to be aligned;

Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;

According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;

When the similarity is greater than the set first threshold, the two knowledge entities are merged.

A knowledge base alignment device. The knowledge base alignment device includes:

An acquisition module, for acquiring a vector set of knowledge entities, wherein the vector set of knowledge entities is a vectorized representation of the knowledge entities in the knowledge base to be aligned;

A processing module, configured to input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;

A calculation module, used to select any two knowledge entities belonging to the same class according to the clustering result, and calculate the similarity between the two knowledge entities;

The execution module is configured to merge the two knowledge entities when the similarity is greater than the set first threshold.

A computer device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing any of the above knowledge when executing the computer-readable instructions The steps of the library alignment method.

A readable storage medium storing computer-readable instructions that when executed by a processor implements the steps of any of the above knowledge base alignment methods.

The details of one or more embodiments of the present application are set forth in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings, and claims.

【Explanation】

In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, without paying any creative work, other drawings can be obtained based on these drawings.

FIG. 1 is a schematic flowchart of a knowledge base alignment method according to an embodiment of this application;

2 is a schematic diagram of vectorization of knowledge entities based on an IF-IDF algorithm according to an embodiment of the present application;

3 is a schematic diagram of a training process of a clustering model based on a convolutional neural network according to an embodiment of the present application;

4 is a schematic diagram of a calculation process of similarity of knowledge entities according to an embodiment of the present application;

5 is a schematic diagram of a process of merging knowledge entities according to an embodiment of the present application;

6 is a block diagram of a basic structure of a knowledge base alignment device according to an embodiment of the present application;

7 is a block diagram of the basic structure of a computer device for implementing this application.

【detailed description】

In order to enable those skilled in the art to better understand the solution of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application.

Some processes described in the specification and claims of this application and the above drawings include multiple operations in a specific order, but it should be clearly understood that these operations may not be in the order in which they appear in this document Execution or parallel execution. The sequence numbers of operations such as 101 and 102 are only used to distinguish different operations. The sequence number itself does not represent any execution sequence. In addition, these processes may include more or fewer operations, and these operations may be performed sequentially or in parallel. It should be noted that the descriptions of "first", "second", etc. in this article are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, nor limit "first" and "second". Are different types.

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without making creative work fall within the protection scope of the present application.

Examples

Those skilled in the art can understand that the "terminal" and "terminal device" used herein include both wireless signal receiver devices, which only have wireless signal receiver devices without transmitting capabilities, and also include hardware for receiving and transmitting hardware. A device having a device capable of performing receiving and transmitting hardware for bidirectional communication on a bidirectional communication link. Such devices may include: cellular or other communication devices with single-line displays or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Services), which can combine voice and data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notepads, calendars and/or GPS (Global Positioning System (Global Positioning System) receiver; conventional laptop and/or palmtop computer or other device that has and/or includes a conventional radio frequency receiver and/or palmtop computer or other device. As used herein, "terminal" and "terminal equipment" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or terrestrial), or adapted and/or configured to operate locally, and/or In a distributed form, it operates at any other location on the earth and/or space. The "terminal" and "terminal device" used herein may also be a communication terminal, an Internet terminal, a music/video playback terminal, for example, may be a PDA, MID (Mobile Internet Device), and/or have music/video playback Functional mobile phones can also be smart TVs, set-top boxes and other devices.

The terminal in this embodiment is the above-mentioned terminal.

Specifically, please refer to FIG. 1, which is a schematic diagram of a basic process of a knowledge base alignment method according to this embodiment.

As shown in Figure 1, a knowledge base alignment method includes the following steps:

S101. Acquire a knowledge entity vector set, where the knowledge entity vector set is a vectorized representation of knowledge entities in a knowledge base to be aligned;

The knowledge entities stored in the knowledge base are usually text or pictures. When aligning the knowledge entities, it is usually necessary to calculate the similarity between the knowledge entities. In order to facilitate computer processing and understanding, the knowledge entities need to be converted into vectors. For example, the vectorized representation of text is realized by a vector space model, also known as a bag of words model. The simplest mode is one-hot encoding based on words, using each word as the dimension key. Some words correspond to position 1, others are 0, and vector length is the same as dictionary size.

S102: Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;

The vector set representing the knowledge entity is input into a preset knowledge entity clustering model. Among them, the clustering model of knowledge entities adopts a density-based clustering algorithm. The density-based clustering algorithm does not need to determine the data of clusters in advance. It can find clusters of any shape, can identify noise points, and compares with outliers. Good robustness can detect outliers. DBSCAN is one of the most typical representative algorithms in this type of method. Its core idea is to first find the points with higher density, and then gradually connect the similar high-density points into one piece, and then generate various clusters. The specific algorithm implementation: for each data point is the center of the circle, draw a circle with eps as the radius (called the neighborhood eps-neigbourhood), and then count how many points are in this circle, and this number is the density value of the point. Then select a density threshold MinPts. For example, the center point of the circle less than MinPts is a low-density point, and the center point of MinPts is greater than or equal to the high-density point (called Core point). If there is a high-density point in the circle of another high-density point, we will connect these two points, so that we can continuously connect many points in series. After that, if there is a low-density point in the circle of high-density points, connect it to the nearest high-density point, which is called the boundary point. In this way, all the points that can be connected together become a cluster, and the low-density points that are not in the circle of any high-density points are abnormal points.

In some embodiments, the trained convolutional neural network model is used to implement clustering, and the convolutional neural network model can be used to train the convolutional neural network model to manually learn the characteristics of the training sample, so that the convolutional neural network model can predict the knowledge entity Perform clustering.

S103. According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;

Through step S102, the knowledge entities in the knowledge base are clustered, and then in the same category, the similarity of any two knowledge entities is calculated to determine whether there are redundant entities, which narrows the scope of the comparison of knowledge entities , Which reduces the amount of calculation and improves the efficiency of determining whether redundant entities exist.

The similarity of two knowledge entities is obtained by calculating the similarity between vectors representing two knowledge entities. The similarity between two vectors may be a cosine similarity. Cosine similarity measures the similarity between two vectors by measuring the cosine of the angle between the two vectors. The cosine value of an angle of 0 degrees is 1, and the cosine value of any other angle is not greater than 1; and its minimum value is -1. Thus the cosine of the angle between the two vectors determines whether the two vectors are pointing in roughly the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the angle between the two vectors is 90°, the cosine similarity value is 0; when the two vectors point in completely opposite directions, the cosine similarity value Is -1. This result has nothing to do with the length of the vector, only the direction of the vector. Cosine similarity is applicable to any dimension vector space, and is often used in high-dimensional positive space, so it is suitable for comparison of text files.

You can also measure the similarity between two vectors by calculating the Euclidean distance between the vectors. In order to avoid the influence of scale, first normalize the vector, and then find the distance between two points X ₁ and X ₂ in the vector space according to the following formula:

Where x _1i and x _2i are X ₁ , and X _{2 is} the value of each dimension after normalization.

S104. When the similarity is greater than the set first threshold, merge the two knowledge entities.

A threshold is set in advance, which is called the first threshold here. When the similarity of the two knowledge entities is greater than the set first threshold, the content of the two knowledge entities is considered to be partially repeated, and the two knowledge entities are merged into one entity.

As shown in Figure 2, before S101, it also includes steps:

S111. Acquire knowledge entities in the knowledge base to be aligned;

The knowledge entity can be obtained by accessing the server where the knowledge base is located. The knowledge entity can belong to the same knowledge repository, or can come from multiple knowledge bases.

S112. Vectorize the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.

To vectorize knowledge entities, in addition to the aforementioned vectorization based on the bag-of-words model, it is also possible to vectorize knowledge entities based on the IF-IDF algorithm. TF-IDF is a statistical method used to evaluate the importance of a word to a document in a document set or a corpus. The importance of a word increases proportionally with the number of times it appears in the document, but at the same time it decreases inversely with the frequency of its appearance in the corpus. TF-IDF is actually: TF*IDF, TF (Term Frequency), IDF (Inverse Document Frequency), inverse document frequency. TF indicates the frequency of entries in document d. Use TF-IDF to vectorize the text, and also construct a dictionary, using the TF-IDF value of each word as the weight of the word.

As shown in FIG. 3, the training of the clustering model based on the convolutional neural network includes the following steps:

S121. Obtain a training sample marked with cluster judgment information, and the cluster judgment information of the training sample is the category of the sample knowledge entity;

In the embodiment of the present application, the training goal of the convolutional neural network is to identify the category to which the knowledge entity belongs, and the convolutional neural network model implements the function of clustering the knowledge entity by manually labeling the characteristics of the category in the training learning sample.

S122. Input the training sample into a convolutional neural network model to obtain model clustering reference information of the training sample;

The convolutional neural network model is composed of: convolutional layer, pooling layer, fully connected and classification layer. Among them, the convolutional layer is used to locally sense the knowledge entity vector, and the convolutional layer is usually connected in a cascade manner, and the later the convolutional layer in the cascade can sense the more global information.

The fully connected layer acts as a "classifier" in the entire convolutional neural network. If the operations of convolutional layer, pooling layer and activation function layer are to map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned "distributed feature representation" to the sample label space . The fully connected layer is connected to the output position of the convolutional layer and can perceive the global characteristics of the knowledge entity vector.

The training samples are input into the convolutional neural network model, and the input clustering reference information of the convolutional neural network model is obtained.

S123. Compare the model clustering reference information of different samples in the training sample with the clustering judgment information by using a loss function;

By judging whether the clustering reference information and the sample-labeled clustering information are consistent through the loss function comparison, the softmax cross-entropy loss function used in the embodiment of the present application is specifically as follows:

Suppose there are N training samples, the input feature for the i-th sample in the last layer of the network is X _i , and its corresponding label is Y _i is the final classification result, h=(h1,h2,...,hc) is The final output of the network is the prediction result of sample i. Where C is the number of all last classifications.

S124. When the model clustering reference information is inconsistent with the clustering judgment information, repeatedly and iteratively update the weights in the convolutional neural network model to the model clustering reference information and the clustering judgment End when the information is consistent.

During the training process, adjust the weight of each node in the convolutional neural network model to make the Softmax cross-entropy loss function converge as much as possible, that is to continue to adjust the weight, the value of the loss function obtained no longer decreases, but instead increases, think Convolutional neural network training can end. The weight of each node is adjusted by gradient descent method, which is an optimization algorithm used in machine learning and artificial intelligence to recursively approximate the minimum deviation model.

Clustering knowledge entities through the trained convolutional neural network model can make the clustering results closer to the user's expectations.

As shown in FIG. 4, step S103 further includes the following steps:

S131. Acquire attributes of the two knowledge entities, where the attributes of the knowledge entity are data describing the corresponding knowledge entity;

In some cases, although the two knowledge entities are not similar in terms of content, the two knowledge entities correspond to an entity in reality, that is, the two knowledge entities describe the two entities of an entity in reality. Part of the information, for the convenience of use, it is also necessary to combine the two parts of information. Therefore, attribute similarity is introduced here. Get the attributes of the knowledge entity first. Attributes are the data used to describe the knowledge entity. They can also be called tags.

S132: Calculate the attribute similarity and vector similarity of the two knowledge entities;

Attribute similarity. In this embodiment of the present application, the editing distance is used to measure the similarity between two knowledge entities. Editing distance refers to the minimum number of operands required to convert character string A to character string B using character manipulation. Character operations include: deleting a character, modifying a character, and inserting a character. Here, the cost of each operation is set to 1, and the attribute similarity can be calculated by the following formula:

Attribute similarity = (1-edit distance)/maximum length of two attribute strings

Vector similarity, that is, the aforementioned cosine similarity or Euclidean distance, which measures the similarity between two knowledge entities.

S133. Calculate the weighted sum of the attribute similarity and vector similarity of the two knowledge entities according to the following formula to obtain the similarity between the two knowledge entities, namely:

S=aX+bY

Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .

Comprehensive attribute similarity and vector similarity can find two knowledge entities describing the same real entity when the content similarity is not high, and merge the knowledge entities describing the same real entity, which is convenient for users to use and knowledge base Maintenance.

Step S104 also includes the following steps:

S141. When the similarity is greater than the set second threshold, wherein the second threshold is greater than the first threshold, delete any one of the two knowledge entities from the knowledge base to be aligned.

When the similarity between two knowledge entities is very high, here we set a second threshold, the second threshold is greater than the aforementioned first threshold, for example, the set second threshold is 0.95, that is, the two knowledge entities are basically the same, then , Deleting any knowledge entity from the knowledge base is an effective way to remove redundancy.

As shown in FIG. 5, step S104 further includes the following steps:

S151. Split the two knowledge entities into several sub-entities;

When the similarity of the two knowledge entities is greater than the preset first threshold, it is considered that part of the content of the two knowledge entities is duplicated. In order to eliminate the duplicate content, the two knowledge entities can be divided into several sub-entities according to certain rules. , For example, according to the content segment.

S152. Select any two of the several sub-entities and calculate the similarity between the two sub-entities;

Select any two sub-entities after segmentation, calculate the similarity between the two sub-entities, that is, as described above, first vectorize the sub-entities, and then calculate the similarity between the vectors representing the sub-entities, which can be cosine similarity, It can also be Euclidean distance.

S153. When the similarity between the two sub-entities is greater than a preset third threshold, delete any one of the two sub-entities, where the third threshold is greater than the first threshold;

When the similarity between two sub-entities is greater than a preset threshold, this is referred to as a third threshold. It is considered that the content of the two sub-entities is basically duplicated, and any one of them is deleted. To avoid deleting too much content, the third threshold is required to be greater than the aforementioned first threshold.

S154. Repeat steps S152 and S153 until the similarity between any two sub-entities in the retained sub-entities is less than or equal to the preset third threshold;

Repeat the comparison of the similarity between the sub-entities, delete the sub-entities with high coincidence, so that the similarity of any two sub-entities in the retained sub-entities is less than or equal to the preset third threshold.

S155. Combine the reserved sub-entities as alignment entities of the two knowledge entities.

The retained sub-entities are merged as the alignment result of the two knowledge entities to be aligned before.

To solve the above technical problems, the embodiments of the present application also provide a knowledge base alignment device. For details, please refer to FIG. 6, which is a block diagram of the basic structure of the knowledge base alignment device of this embodiment.

As shown in FIG. 6, a knowledge base alignment device includes: an acquisition module 210, a processing module 220, a calculation module 230, and an execution module 240. Wherein, the obtaining module 210 is used to obtain a knowledge entity vector set, wherein the knowledge entity vector set is a vectorized representation of the knowledge entity in the knowledge base to be aligned; the processing module 220 is used to input the knowledge entity vector set Go to the preset knowledge entity clustering model to obtain the clustering result of the knowledge entities in the knowledge base to be aligned; the calculation module 230 is used to select any two knowledge entities belonging to the same class according to the clustering result, Calculate the similarity between the two knowledge entities; the execution module 240 is configured to merge the two knowledge entities when the similarity is greater than the set first threshold.

In this embodiment of the present application, by acquiring a knowledge entity vector set, and inputting the knowledge entity vector set into a preset knowledge entity clustering model, a clustering result of knowledge entities in the knowledge base to be aligned is obtained, according to the clustering As a result, any two knowledge entities belonging to the same category are selected, the similarity between the two knowledge entities is calculated, and when the similarity is greater than the set first threshold, the two knowledge entities are merged. The comparison of the similarity of two knowledge entities is limited to the same type of entity, which greatly reduces the amount of calculation. Among them, the calculation of similarity combines the attribute similarity of the entity and the vector similarity, making the calculation of similarity more reasonable and more effective. Discover and remove redundant information.

In some embodiments, the knowledge base alignment device further includes: a first acquisition submodule and a first processing submodule. The first acquisition submodule is used to acquire knowledge entities in the knowledge base to be aligned; the first processing submodule is used to vectorize the knowledge entities based on the IF-IDF algorithm to obtain the knowledge entity vector set.

In some embodiments, the predetermined knowledge entity clustering model in the knowledge base alignment device uses a DBSCAN density clustering algorithm.

In some embodiments, the knowledge entity clustering model preset in the knowledge base alignment device uses a convolution neural network-based clustering model.

In some embodiments, the calculation module 230 includes: a second acquisition submodule, a first calculation submodule, and a second calculation submodule. Wherein, a second acquisition submodule is used to acquire the attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entities; a first calculation submodule is used to calculate the two knowledges The attribute similarity and vector similarity of the entity; the second calculation submodule is used to calculate the weighted sum of the attribute similarity and the vector similarity of the two knowledge entities according to the following formula, to obtain the relationship between the two knowledge entities Similarity, ie:

S=aX+bY

In some embodiments, the execution module 240 includes a first execution sub-module for when the similarity is greater than a set second threshold, wherein the second threshold is greater than the first threshold, from Any one of the two knowledge entities is deleted from the knowledge base to be aligned.

In some embodiments, the execution module 240 includes: a first division submodule, a third calculation submodule, a second execution submodule, a first loop submodule, and a third execution submodule. Wherein, the first division sub-module is used to divide the two knowledge entities into several sub-entities; the third calculation sub-module is used to select any two sub-entities of the plurality of sub-entities and calculate the two sub-entities The similarity between the two; the second execution sub-module is used to delete any one of the two sub-entities when the similarity between the two sub-entities is greater than a preset third threshold, wherein the The three thresholds are greater than the first threshold; the first loop submodule is used to make the third calculation submodule and the second execution submodule run repeatedly until the similarity between any two subentities in the reserved subentities is less than or Equal to a preset third threshold; a third execution sub-module, configured to merge the reserved sub-entities as an alignment entity of the two knowledge entities.

To solve the above technical problems, embodiments of the present application also provide computer equipment. For details, please refer to FIG. 7, which is a block diagram of the basic structure of the computer device of this embodiment.

As shown in FIG. 7, a schematic diagram of the internal structure of the computer device. As shown in FIG. 7, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. The non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store a sequence of control information. When the computer-readable instructions are executed by the processor, the processor may implement a A method of knowledge base alignment. The processor of the computer device is used to provide calculation and control capabilities, and support the operation of the entire computer device. The computer device may store computer readable instructions in the memory. When the computer readable instructions are executed by the processor, the processor may cause the processor to perform a knowledge base alignment method. The network interface of the computer device is used to connect and communicate with the terminal. Those skilled in the art can understand that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.

In this embodiment, the processor is used to execute the specific content of the acquisition module 210, the processing module 220, the calculation module 230, and the execution module 240 in FIG. 6. The memory stores computer-readable instructions and various types of data required to execute the above modules. The network interface is used for data transmission between user terminals or servers. The memory in this embodiment stores the computer-readable instructions and data required to execute all submodules in the knowledge base alignment method, and the server can call the computer-readable instructions and data of the server to execute the functions of all submodules.

The computer device obtains the knowledge entity vector set and inputs the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of the knowledge entities in the knowledge base to be aligned, and according to the clustering result, Select any two knowledge entities that belong to the same class, calculate the similarity between the two knowledge entities, and when the similarity is greater than the set first threshold, merge the two knowledge entities. The comparison of the similarity of two knowledge entities is limited to the same type of entity, which greatly reduces the amount of calculation. Among them, the calculation of similarity combines the attribute similarity of the entity and the vector similarity, making the calculation of similarity more reasonable and more effective. Discover and remove redundant information

The present application also provides a storage medium storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the knowledge base alignment described in any of the foregoing embodiments Method steps.

Those of ordinary skill in the art may understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through computer-readable instructions, which can be stored in a computer-readable storage medium When executed, the computer-readable instructions may include the processes of the foregoing method embodiments. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

It should be understood that although the steps in the flowchart of the drawings are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless there is a clear description in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least a part of the steps in the flow chart of the drawings may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution order is also It is not necessarily performed sequentially, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.

The above is only part of the implementation of this application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and retouches can be made. These improvements and retouches also It should be regarded as the scope of protection of this application.

Claims

A knowledge base alignment method is characterized by the following steps:

Acquiring a set of knowledge entity vectors, where the set of knowledge entity vectors is a vectorized representation of knowledge entities in the knowledge base to be aligned;

Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;

According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;

When the similarity is greater than the set first threshold, the two knowledge entities are merged.
The knowledge base alignment method according to claim 1, wherein before the step of acquiring a knowledge entity vector set, the method further comprises the following steps:

Obtain the knowledge entities in the knowledge base to be aligned;

Vectorizing the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
The knowledge base alignment method according to claim 1, wherein the predetermined knowledge entity clustering model uses a DBSCAN density clustering algorithm.
The knowledge base alignment method according to claim 1, wherein the preset knowledge entity clustering model uses a convolutional neural network-based clustering model, and the convolutional neural network-based clustering model The training consists of the following steps:

Obtaining training samples marked with clustering judgment information, the clustering judgment information of the training samples being the category of sample knowledge entities;

Input the training sample into a convolutional neural network model to obtain model clustering reference information of the training sample;

Comparing the model clustering reference information of different samples in the training samples with the clustering judgment information through a loss function;

When the model clustering reference information is inconsistent with the clustering judgment information, the weights in the convolutional neural network model are updated repeatedly and iteratively, until the model clustering reference information is consistent with the clustering judgment information At the end.
The knowledge base alignment method according to claim 1, wherein any two knowledge entities belonging to the same class are selected according to the clustering result, and the similarity between the two knowledge entities is calculated The steps include the following steps:

Acquiring attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entity;

Calculating the attribute similarity and vector similarity of the two knowledge entities;

The weighted sum of attribute similarity and vector similarity of the two knowledge entities is calculated according to the following formula to obtain the similarity between the two knowledge entities, namely:

S=aX+bY

Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
The method for aligning a knowledge base according to claim 1, wherein the step of merging the two knowledge entities when the similarity is greater than a set first threshold further includes the following steps :

When the similarity is greater than the set second threshold, wherein the second threshold is greater than the first threshold, any one of the two knowledge entities is deleted from the knowledge base to be aligned.
The knowledge base alignment method according to claim 1, wherein the step of merging the two knowledge entities when the similarity is greater than a set first threshold further includes the following steps :

a. Divide the two knowledge entities into several sub-entities;

b. select any two of the several sub-entities and calculate the similarity between the two sub-entities;

c. When the similarity between the two sub-entities is greater than a preset third threshold, delete any one of the two sub-entities, wherein the third threshold is greater than the first threshold;

d. Repeat steps b and c until the similarity between any two sub-entities in the retained sub-entities is less than or equal to the preset third threshold;

e. Merging the reserved sub-entities as an alignment entity of the two knowledge entities.
A knowledge base alignment device, characterized in that it includes:

An acquisition module, for acquiring a vector set of knowledge entities, wherein the vector set of knowledge entities is a vectorized representation of the knowledge entities in the knowledge base to be aligned;

A processing module, configured to input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;

A calculation module, used to select any two knowledge entities belonging to the same class according to the clustering result, and calculate the similarity between the two knowledge entities;

The execution module is configured to merge the two knowledge entities when the similarity is greater than the set first threshold.
The knowledge base alignment device according to claim 8, further comprising:

A first acquisition sub-module for acquiring knowledge entities in the knowledge base to be aligned;

The first processing submodule is used to vectorize the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
The knowledge base alignment device according to claim 8, wherein the calculation module comprises:

A second obtaining submodule, configured to obtain attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing corresponding knowledge entities;

A first calculation sub-module for calculating the attribute similarity and vector similarity of the two knowledge entities;

The second calculation submodule is used to calculate the weighted sum of attribute similarity and vector similarity of the two knowledge entities according to the following formula, to obtain the similarity between the two knowledge entities, namely:

S=aX+bY

Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
A computer device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that the processor is implemented as follows when executing the computer-readable instructions Steps of knowledge base alignment method:

Acquiring a set of knowledge entity vectors, where the set of knowledge entity vectors is a vectorized representation of knowledge entities in the knowledge base to be aligned;

Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;

According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;

When the similarity is greater than the set first threshold, the two knowledge entities are merged.
The computer device according to claim 11, further comprising the following steps before the step of acquiring a vector set of knowledge entities:

Obtain the knowledge entities in the knowledge base to be aligned;

Vectorizing the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
The computer equipment according to claim 11, wherein the predetermined knowledge entity clustering model uses a DBSCAN density clustering algorithm.
The computer device according to claim 11, wherein the preset knowledge entity clustering model uses a convolutional neural network-based clustering model, and the training of the convolutional neural network-based clustering model includes The following steps:

Obtaining training samples marked with clustering judgment information, the clustering judgment information of the training samples being the category of sample knowledge entities;

Input the training sample into a convolutional neural network model to obtain model clustering reference information of the training sample;

Comparing the model clustering reference information of different samples in the training samples with the clustering judgment information through a loss function;

When the model clustering reference information is inconsistent with the clustering judgment information, the weights in the convolutional neural network model are updated repeatedly and iteratively, until the model clustering reference information is consistent with the clustering judgment information At the end.
The computer device according to claim 11, wherein in the step of calculating any similarity between the two knowledge entities according to the clustering result, selecting any two knowledge entities belonging to the same class It includes the following steps:

Acquiring attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entity;

Calculating the attribute similarity and vector similarity of the two knowledge entities;

The weighted sum of attribute similarity and vector similarity of the two knowledge entities is calculated according to the following formula to obtain the similarity between the two knowledge entities, namely:

S=aX+bY

Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
A readable storage medium storing computer-readable instructions, characterized in that, when the computer-readable instructions are executed by a processor, the processing is performed as follows:

Acquiring a set of knowledge entity vectors, where the set of knowledge entity vectors is a vectorized representation of knowledge entities in the knowledge base to be aligned;

Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;

According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;

When the similarity is greater than the set first threshold, the two knowledge entities are merged.
The readable storage medium according to claim 16, characterized in that, before the step of acquiring a knowledge entity vector set, the method further comprises the following steps:

Obtain the knowledge entities in the knowledge base to be aligned;

Vectorizing the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
The readable storage medium according to claim 16, wherein the predetermined knowledge entity clustering model uses a DBSCAN density clustering algorithm.
The readable storage medium according to claim 16, wherein the predetermined knowledge entity clustering model uses a convolutional neural network-based clustering model, and the convolutional neural network-based clustering model The training consists of the following steps:

Obtaining training samples marked with clustering judgment information, the clustering judgment information of the training samples being the category of sample knowledge entities;

Input the training sample into a convolutional neural network model to obtain model clustering reference information of the training sample;

Comparing the model clustering reference information of different samples in the training samples with the clustering judgment information through a loss function;

When the model clustering reference information is inconsistent with the clustering judgment information, the weights in the convolutional neural network model are updated repeatedly and iteratively, until the model clustering reference information is consistent with the clustering judgment information At the end.
The readable storage medium according to claim 16, characterized in that, according to the clustering result, any two knowledge entities belonging to the same class are selected, and the similarity between the two knowledge entities is calculated The steps include the following steps:

Acquiring attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entity;

Calculating the attribute similarity and vector similarity of the two knowledge entities;

The weighted sum of attribute similarity and vector similarity of the two knowledge entities is calculated according to the following formula to obtain the similarity between the two knowledge entities, namely:

S=aX+bY

Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .