CN110941722A

CN110941722A - Knowledge graph fusion method based on entity alignment

Info

Publication number: CN110941722A
Application number: CN201910967655.9A
Authority: CN
Inventors: 赵翔; 曾维新; 唐九阳; 徐浩; 谭真; 殷风景; 葛斌; 肖卫东
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-03-31
Anticipated expiration: 2039-10-12
Also published as: CN110941722B

Abstract

The invention discloses a knowledge graph fusion method based on entity alignment, which comprises the following steps: acquiring data of two knowledge maps; learning the structure vector of the entity by using a graph convolution network, and expressing the name of the entity as a word vector; calculating a composite distance between the entities to represent a degree of similarity between the entities; performing entity identification alignment by adopting an iterative training frame based on course learning; and according to the entity alignment result, fusing the two knowledge maps into one knowledge map. The method designs an entity alignment basic framework which integrates the structural feature and the entity name feature; an iterative training method based on course learning is designed, training data are easily and difficultly amplified, a word-shifting distance model is adopted to reorder the preorder alignment results, and entity name information is fully mined, so that the fusion of knowledge maps is more accurate and comprehensive.

Description

Knowledge graph fusion method based on entity alignment

Technical Field

The invention belongs to the field of knowledge graph generation and fusion, and particularly relates to a knowledge graph fusion method based on entity alignment.

Background

In recent years, a large number of knowledge maps (KGs) have emerged, such as YAGO, DBpedia, NELL, and CN-DBpedia, zhishi. The large-scale knowledge maps play an important role in intelligent services such as question-answering systems, personalized recommendation and the like. In addition, to meet specific domain-related needs, more and more domain knowledge maps, such as medical knowledge maps, are being derived. In the knowledge graph construction process, the trade-off between the coverage rate and the accuracy rate is inevitably needed. Any knowledge graph cannot be complete or completely correct.

In order to improve the coverage rate and accuracy of the knowledge graph, one possible method is to introduce relevant knowledge from other knowledge graphs, because the knowledge redundancies and complementation exist among the knowledge graphs constructed in different ways. For example, a constructed generic knowledge graph extracted from a web page may contain only the name of a drug, while more information may be found in a medical knowledge graph constructed based on medical data. To integrate knowledge in the external knowledge-graph into the target knowledge-graph, the most important step is to align the different knowledge-graphs. The Entity Alignment (EA) task aims to find pairs of entities in different knowledge-graphs that express the same meaning. And the entity pairs serve as hubs for linking different knowledge maps to serve subsequent tasks.

At present, the mainstream entity alignment method mainly judges whether two entities point to the same thing by means of the structural features of a knowledge graph. Such methods assume that entities expressing the same meaning in different knowledge graphs have similar adjacent information. On artificially constructed data sets, this type of method achieves the best experimental results. But a recent work has indicated that these manually constructed data sets have a more dense knowledge-graph than the real-world knowledge-graph, and the structural feature-based entity alignment approach has a far less effective knowledge-graph with normal distribution.

In fact, by analyzing the distribution of entities in the real-world knowledge graph, more than half of the entities are connected to only one or two other entities. These entities, called long-tail entities (long-tail entities), occupy most of knowledge graph entities, so that the graph as a whole presents high sparsity. This also corresponds to the knowledge of the real world knowledge map: only a few entities are frequently used and have rich adjacency information; most entities are mentioned only rarely, containing little structural information. Therefore, current entity alignment methods based on structural information do not perform well on real-world datasets.

In addition, the lack of annotation data greatly limits the effectiveness of entity alignment. To map the representation vectors of different knowledge-graphs into the same space, enough annotation data is needed as a link. However, the number of known pairs of entities is limited. In order to solve the problem, part of methods propose to adopt Iterative Training (IT) to select high-confidence entity pairs from test set results to be used as next round of training, but have the problems of easy introduction of error samples, low efficiency and the like. In addition, on a data set with real world distribution, the iterative training frameworks can only introduce a small number of high-confidence entity pairs, and cannot bring obvious effect improvement.

Disclosure of Invention

In view of this, the present invention provides a knowledge graph fusion method based on entity alignment, which overcomes the shortcomings of the prior art, and is used for identifying and aligning the same or similar entities from a plurality of knowledge graphs, thereby implementing knowledge fusion of the plurality of knowledge graphs and improving the coverage rate and accuracy rate of the knowledge graphs.

Based on the above purpose, a knowledge graph fusion method based on entity alignment comprises the following steps:

step 1, acquiring data of two knowledge maps;

step 2, learning the structure vector of the entity by using a graph convolution network; representing the names of the entities as word vectors;

step 3, calculating the comprehensive distance between the entities to express the similarity degree between the entities;

step 4, adopting an iterative training frame based on course learning to perform entity identification alignment;

and 5, fusing the two knowledge maps into one knowledge map according to the entity alignment result.

The two knowledge maps are represented as G₁＝(E₁,R₁,T₁) And G₂＝(E₂,R₂,T₂) Wherein E represents an entity, R represents a relationship,

representing triplets in a graph, known entity pairs are represented as

The entity alignment task aims to find a new entity pair by utilizing the known entity pair information and generate a final alignment result

Wherein the equal sign represents that the two entities point to the same real world entity;

performing entity identification alignment on the iterative training frame based on course learning in the step 4, wherein in the iterative training frame, the input of each round of iterative training is a knowledge graph to be aligned and an aligned entity pair, wherein the aligned entity pair is a training set, and the output is an alignment result and an amplified training set; high confidence entity pairs are obtained and added to trainingThe training data is used for the next round of training; when the high-confidence entity pairs in the test set are added into the training set, the high-confidence entity pairs will not appear in the next round of test set, and the iterative training will continue until the number of newly added entity pairs is lower than a given threshold value theta₂；

The high-confidence entity pair is to G₁Each entity e to be aligned in₁Suppose G₂With the nearest entity being e₂The second near entity is e₂' distance difference is delta₁＝D(e₁,e₂′)-D(e₁,e₂). And for e₂If G is said₁With the nearest entity exactly e₁The second near entity is e₁' distance difference is delta₂＝D(e₂,e₁′)-D(e₂,e₁) And Δ₁≥θ₁，Δ₂≥θ₁Then (e) is considered₁,e₂) Is a high confidence entity pair, θ₁Is a preset distance difference threshold.

Specifically, in the step 2, two-layer graph convolution networks are used for processing two knowledge graph data and generating corresponding entity structure vectors respectively;

entity e of two knowledge graphs in step 3₁∈G₁And e₂∈G₂The structural distance is D in structural space_s(e₁,e₂)＝||e₁e₂||_l1/d_s，d_sIs the structural matrix dimension; the word characteristic distance is D_t(e₁,e₂)′||ne(e₁)-ne(e₂)||_l1/d_tSuppose that entity e includes the word w in its name₁,w₂,...,w_pThen the entity name vector may be represented as the average of these word vectors, i.e.

Wherein w_iIs w_iWord vector of d_tIs the name vector matrix dimension;

the fusion formula of the comprehensive distance in the step 4 is as follows:

D(e₁,e₂)＝αD_s(e₁,e₂)+(1α)D_t(e₁,e₂)

wherein α is a hyperparameter used to adjust the weights of two features.

Preferably, the characteristic distance is calculated by a word-shift distance model, which is intended to measure the difference between different sentences, and the word-shift distance is expressed as the minimum distance value of embedded vectors of all words in an entity that need to be shifted to reach embedded vectors of all words in another entity.

Specifically, the input of the graph convolution network is a feature matrix of an entity

And an adjacency matrix A of the graph, and the output is a feature matrix with structure information

N represents the number of nodes in the graph, and P and F represent the dimensions of the input and output matrix features, respectively, assuming the input of the l-th layer as the feature matrix of the nodes

Wherein d is^lDimension representing the characteristic matrix of the l-th layer, for the first layer, H¹＝X， d¹P; the first layer output is

Wherein

I is an identity matrix and is a matrix of the identity,

is composed of

The diagonal matrix of (a) is,

is a parameter matrix of the l-th layer, d^l+1Is the dimension of the feature matrix of the next layer, the activation function σ is often set to ReLU, H for the last layer^l+1＝Z，d^l+1＝F。

Specifically, an initial feature matrix X is obtained by sampling from L2 regularized truncated normal distribution, and is updated through training of each layer of GCN, so that structural information in a knowledge graph is fully captured, and an output feature matrix Z is generated; the dimension of the feature matrix is always set to d_s，P＝F＝d^l＝d_sAnd two GCNs share the feature matrix W in two layers¹And W²。

Specifically, the training objective is to minimize the following loss values:

wherein [ x ]]₊＝max{0,x}，

The representation is based on a known entity pair (e)₁,e₂) E is to be₁Or e₂And replacing the negative sample set generated by a random entity, wherein e represents a structure vector of the entity e, and gamma represents an end distance separating the positive sample from the negative sample, and performing model optimization by adopting random gradient descent.

Specifically, the difficulty level of the course can be characterized by the degree of the entity node: entities with higher degrees have richer structural information and are easier to align; and the long-tail entity with low alignment degree is relatively difficult, and in the iterative training process, an easy entity pair is added firstly, and a difficult entity pair is added, so that the model is trained easily and difficultly.

In particular, assume that there are δ courses, c, from simple to difficult₁,…c_δRespectively representing a series of entity node degree values from large to small, then the high confidence value obtained in each iteration of trainingDegree entity centering, only selecting the node degree larger than c₁Is added into the training set, and the condition is kept to carry out the iterative training in a loop until the number of the entity pairs meeting the requirement is lower than a given threshold value theta₂Stopping the training of the course difficulty;

in the next training, the course difficulty is adjusted, and the condition is changed to select the degree greater than c from the high-confidence entity pair₂Adding the new entity pairs into a training set, and keeping the course difficult to carry out loop iterative training until the number of the newly added entity pairs meeting the requirements is lower than a given value theta₂Stopping the training of the course difficulty; finally, repeating the above steps to traverse the rest course difficulty c₃,…c_δ。

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) an entity alignment basic framework which fuses the structural feature and the entity name feature is designed. Because the entity name and the structure information are mutually complemented, and the entity name is not influenced by the degree of the entity node, the basic framework can greatly improve the alignment result of the long-tail entity, and further optimize the overall alignment effect.

(2) On a basic entity alignment framework, an iterative training strategy based on Course Learning (CL) is designed, and the effect of entity alignment can be remarkably improved while the training efficiency is ensured. The method is inspired by the course learning idea, the entity node degree is used as a measurement index, the entity with higher degree is used as a simple course, the long-tail entity is used as a difficult course, the entity pair with high confidence level is added into a training set in a simple to difficult mode, the iterative training mode is optimized, the structural feature representation accuracy is improved, and the model training is easier to achieve the optimum.

(3) And (3) based on a Word Move's Distance (WMD) reordering model, namely, on the entity ordering result generated in the first two steps, the word move distance model is utilized to further mine entity name information and is combined with the structure information to optimize the entity alignment effect.

Drawings

Fig. 1 is a schematic overall flow chart of an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

As shown in fig. 1, a knowledge-graph fusion method based on entity alignment includes the following steps:

step 1, acquiring data of two knowledge maps;

For a better understanding of the present disclosure, all possible meanings of the symbols are given. H^l: l-th layer structural feature matrix, N: number of nodes, X: initial structural feature matrix, d^l: layer i feature matrix dimension, Z: final structural feature matrix, S: known entity pair, a: adjacency matrix, d_s: dimension of structural matrix, W^l: layer I parameter matrix, D_s: entity spacing under structural space, e: structural vector of entity e, d_t: name vector matrix dimension, P: initial structural feature matrix dimension, F: final structural feature matrix dimension, N: name vector matrix of entity, D: distance between entities, G₁: to-be-aligned knowledge graph 1, G₂: to-be-aligned knowledge graph 2, e₁：G₁Middle entity, e₂：G₂Middle entity, Δ₁：e₁Difference between the nearest two entities, Δ₂：e₂Difference in distance between two nearest entities, θ₁: distance difference threshold, theta₂: a threshold number of newly added entity pairs.

For instanceThe formalized description of the volume alignment problem is given by two knowledge graphs, G₁＝(E₁,R₁,T₁) And G₂＝(E₂,R₂,T₂) Wherein E represents an entity, R represents a relationship,

representing triplets in the atlas. The known entity pair is represented as

Where equal signs indicate that the two entities point to the same real world entity. Given an entity, the process of finding its corresponding entity in another knowledge-graph can be considered a ranking problem. That is, under a certain feature space, the degree of similarity (distance) of a given entity to all entities in another knowledge-graph is calculated and given an ordering, and the entity with the highest degree of similarity (distance is the smallest) can be regarded as an alignment result.

Taking the medical knowledge map as an example, in order to obtain more medical knowledge, a plurality of independent medical knowledge maps can be fused, and in order to better fuse the medical knowledge map, entities in the medical knowledge map need to be identified, wherein the entities include names of medicines, names of diseases and names of symptoms. The three types of entities are the most basic entities of the medical knowledge graph, the alignment of the three types of entities is made, the most basic requirements of the medical knowledge graph are met, and the extraction of other entities can be determined according to actual needs.

The embodiment captures entity adjacency structure information and generates an entity structure representation vector by using a Graph Convolution Network (GCN). The GCN is a convolutional network that acts directly on graph structure data, generating corresponding node structure vectors by capturing the structure information around the nodes. The input to the GCN is a feature matrix of the entity

And the adjacency matrix a of the figure. The output is a feature matrix with structure information

N represents the number of nodes in the map, while P and F represent the dimensions of the input and output matrix features, respectively.

GCN models typically contain multiple GCN layers. In particular, assume that the input at layer I is a feature matrix of nodes

Wherein d is^lDimension representing the characteristic matrix of the l-th layer (for the first layer, H¹＝X，d¹P). The first layer output is

Wherein

I is an identity matrix and is a matrix of the identity,

is composed of

The diagonal matrix of (a).

Is a parameter matrix of the l-th layer, d^l+1Is the dimension of the next level feature matrix. The activation function σ is often set to ReLU. For the last layer, H^l ⁺¹+1＝Z，＝Z，d^l+1+1＝F。

In this embodiment, two-layer GCNs are constructed, each of which is used to process a knowledge graph and generate a corresponding entity vector, where an initial feature matrix X is obtained by sampling from a truncated normal distribution normalized by L2, and is updated through training of each layer of the GCN, so as to fully capture structural information in the knowledge graph and generate an output feature matrix ZIs set to d_s(P＝F＝d^l＝d_s) And two GCNs share the feature matrix W in two layers¹And W²。

The entity structure vectors of different knowledge-graphs are not in the same space, so it is necessary to align them into the same space using a known entity pair S. A specific training objective is to minimize the following loss values:

wherein [ x ]]₊＝max{0,x}，

The representation is based on a known entity pair (e)₁，e₂) E is to be₁Or e₂Instead, a set of negative examples generated by the random entity. e represents the structure vector of entity e. Gamma represents the end distance separating the positive and negative samples. Model optimization was performed using a random gradient descent.

Given the final structural feature matrix Z, e₁∈G₁And e₂∈G₂The distances under the structural space are:

D_s(e₁,e₂)＝||e₁-e₂||_l1/d_s

if only the structural features are considered, the distance D between the target entity e and the structural features_sThe closest entity will be considered the corresponding entity of e.

Unlike the prior art, the present embodiment proposes to align with text features simultaneously. Specifically, in the text form of entity name, it is considered that 1) entity name is often used to identify entity and widely exists; 2) by comparing the entity names, whether the two entities are the same or not can be judged visually; 3) the method is not influenced by the scale of the training set and has stronger stability.

Although the conventional string comparison method can be used to measure the similarity between two entity names, the semantic similarity between the entity names is selected in this embodiment because it is also applicable when the knowledge maps are very different, such as the alignment of multi-language knowledge maps. In particular toIn consideration of the simplicity and universality of the average word vector representation, semantic information can be expressed without a special corpus, so that the average word vector representation is used as an entity name vector. Suppose that entity e includes the word w in its name₁,w₂,...,w_pThen the entity name vector may be represented as the average of these word vectors, i.e.

Wherein w_iIs w_iThe word vector of (2). The name vector for all entities can be represented as N.

Similar to word vectors, similar entity names will be very close in vector space. e.g. of the type₁∈G₁And e₂∈G₂Distance under text feature space is D_t(e₁,e₂)＝||ne(e₁)-ne(e₂)||_l1/d_t. If only the entity name characteristics are considered, the distance D between the entity name characteristics and the target entity e is_tThe closest entity will be considered the corresponding entity of e. For cross-language entity alignment, pre-training cross-language word vectors can be utilized, thereby ensuring that cross-language entity name vectors are in the same space.

Considering that structural and name features depict entities from two different aspects, structural and semantic, respectively, they can be further combined to provide a more comprehensive alignment cue. In particular, two entities e₁∈G₁And e₂∈G₂The distance between them is:

D(e₁,e₂)＝αD_s(e₁,e₂)+(1-α)D_t(e₁,e₂)

wherein α is the hyperparameter used to adjust the weight of the two features, under the space after feature fusion, the entity closest to the target entity e by the distance D will be considered as the corresponding entity of e.

The number of labeled data is limited, and vectors of different knowledge maps cannot be effectively mapped into the same space, so that the effect of entity alignment is limited. Therefore, the present embodiment proposes to add the entity alignment result with high confidence to the next round of training data from simple to difficult, iteratively expand the training set size and improve the entity alignment result. First, a basic iterative training framework is introduced, and then how the idea of curriculum learning is applied to the iterative framework to optimize the training effect is explained.

The input of each round of iterative training is the knowledge graph to be aligned and the aligned entity pair (training set), and the output is the alignment result and the amplified training set. One of the simplest amplification methods is for G₁Each entity e to be aligned in₁Suppose G₂With the nearest entity being e₂(ii) a And for e₂In particular, G₁The nearest entity in the middle is exactly e₁Then (e) can be considered₁,e₂) Are high confidence entity pairs and are added to the training data. However, in this process, a part of wrong entity pairs is inevitably introduced, thereby causing negative effects on the subsequent training. Once the wrong entity pairs are added, it is difficult to re-evaluate the correctness of these entity pairs or remove them from the training data.

To this end, the present embodiment proposes a simple method that can greatly reduce the probability of introducing erroneous pairs of entities. For G₁Each entity e to be aligned in₁Suppose G₂With the nearest entity being e₂The second near entity is e₂' distance difference is delta₁＝D(e₁,e₂′)-D(e₁,e₂). And for e₂If G is said₁With the nearest entity exactly e₁The second near entity is e₁' distance difference is delta₂＝D(e₂,e₁′)-D(e₂,e₁) And Δ₁≥θ₁，Δ₂≥θ₁Then (e) can be considered₁,e₂) Are high confidence entity pairs and are added to the training data for the next round of training. The above method has a higher selection criterion for high confidence entity pairs: the distance between two entities is the closest from both sides, and there is a certain distance difference between the closest entity and the second closest entity. This ensures to some extent the correct rate of newly joining entity pairs. As described aboveThe iterative training will continue until the number of newly added entity pairs is below a given threshold θ₂。

It should be noted that in the iterative training framework designed in this embodiment, when the high-confidence entity pairs in the test set are added to the training set, the high-confidence entity pairs will not appear in the next test set, i.e. the number of entities in the test set is continuously reduced. This can improve the alignment effect of the remaining entities in the test set to some extent, because the number of candidate entities is greatly reduced compared to the original. Experimental results show that the iterative training framework provided by the invention can bring better effect.

The main idea of course learning is to simulate the characteristics of human learning, and from simple to difficult learning, the model can find local optimum more easily, and the training speed is accelerated. In the entity alignment task, the difficulty level of the course can be described by the degree of the entity node: entities with higher degrees have richer structural information and are easier to align; whereas long tail entities with low degrees of alignment are relatively difficult. Therefore, in the iterative training process, the easy entity pairs are added firstly, and the difficult entity pairs are added, so that the model is trained easily and difficultly, and the training is easier to achieve the optimal.

In particular, assume that there are δ courses, c, from simple to difficult₁,…c_δRespectively representing a series of entity node degree values from large to small, and only selecting the node degree value larger than c in the high-confidence entity pair obtained by each iterative training₁Is added into the training set, and the condition is kept to carry out the iterative training in a loop until the number of the entity pairs meeting the requirement is lower than a given threshold value theta₂And stopping the training of the course difficulty.

In the next training, the course difficulty is adjusted, and the condition is changed to select the degree greater than c from the high-confidence entity pair₂Adding the new entity pairs into a training set, and keeping the course difficult to carry out loop iterative training until the number of the newly added entity pairs meeting the requirements is lower than a given value theta₂And stopping the training of the course difficulty. Finally, the steps are repeated, and the rest courses are difficult to traverseDegree c₃,…c_δ。

Iterative training based on course learning generates more accurate entity representation vectors by optimizing the adding mode of high-confidence entity pairs, and further improves the alignment effect.

The iterative training framework based on course learning has greatly improved the accuracy of entity alignment, and on the basis, further mining entity name information is provided, and a word-shifting distance model is adopted to reorder the preorder results and optimize the entity alignment effect.

The word-shift distance model aims to measure the difference between different sentences, and represents that the embedded vectors of all words in one sentence need to be shifted to reach the minimum distance value of the embedded vectors of all words in another sentence. Compared with the distance between average word vectors, the word shift distance can better depict the influence of each word in the sentence on the whole sentence, and the semantic loss caused by average operation is avoided. However, this model is time consuming due to the need to compute word-level distances, and is not suitable for large-scale data. For this reason, the method is not used to calculate the distance between entity names from the beginning, but rather it is used to reorder the preamble results.

Specifically, after the iterative training based on course learning is finished, for each entity to be aligned in the test set, h entities closest to the entity to be aligned in another knowledge graph are reserved and are sent to a word-shift distance model as input, and the distance between the entities in the entity namespace is recalculated. And finally, calculating to obtain a new distance between entities and a reordered alignment result by using the updated entity name distance and combining the calculation formula.

In order to ensure the practicability and effectiveness of the method, relevant test experiments are carried out, and basic settings of the experiments are introduced firstly, wherein the basic settings comprise parameter settings, data sets, comparison methods and measurement indexes. The experimental results on both the cross-language entity alignment and the single-language entity alignment are then presented. Feature analysis is then performed to verify the validity of each module. Finally, the case analysis makes the framework of the invention have a clearer understanding.

Parameter setting and measurement index

For solid structural features, d_s300 rounds of training, 300, generate five negative examples for each positive example. For the entity name features, the entity name vector is generated by using the fastText pre-training word vector, and the cross-language word vector is obtained through MUSE. Wherein the fastText vector is obtained by training a CBOW model, and the dimension is 300 (namely d)_t300), character length 5, window size 5, negative to positive example ratio 10, through proof-set-up experiments, the hyper-parameter α is set to 0.3₁＝0.05，θ₂＝20。 c₁,…c_δ10, 6, 4, 2, 0 and δ 5. In the word shift distance model, h is 100.

Hits @ k (k ═ 1,10), and Mean Reciprocal Rank (MRR) were used as metrics. For each entity in the test set, the entities in the other knowledge-graph are ranked from low to high according to the distance D from the entity. Hits @ k reflects the proportion of the first k entities that contain the correct entity. In particular, Hits @1 represents the accuracy of the alignment. MRR represents the inverse of the average ranking of the correct entities. Although Hits @1 is the most important measure, Hits @10 can be considered as a complement to Hits @ 1. Assuming that some method fails to rank the correct entity as the closest entity, if it ranks the correct entity as the first 10 near entity, this method is at least better than the method that does not rank the correct entity as the first 10 near entity. MRRs can also provide similar information supplements. Note that high Hits @ k and MRR represent better experimental results, with Hits @ k in the experiment being expressed as a percentage.

Data set and comparison method

The present embodiment will test the proposed method on two cross-language entity aligned datasets EN-FR, EN-DE and two single-language entity aligned datasets DBP-WD, DBP-YG. See table 1 for detailed data set information.

Table 1 data set overview

In addition, a comparison is made with the following method.

MTransE: methods for entity alignment using knowledge-graph embedding (TransE) were first proposed.

IPtransE: and an iterative training frame is adopted to improve the alignment effect.

BootEA: an alignment-based knowledge graph embedding method and a bootstrap strategy are designed.

JAPE: optimizing structural information using attribute information

GCN-Align: entity vectors are generated using the GCN and combined with attribute vectors to align entities.

RSNs: a cyclic neural network based on residual error learning is adopted to effectively capture long-distance relation dependence inside the knowledge graph and between the knowledge graphs.

GM-Align: a local entity graph is constructed for each entity to capture more local information. Entity name information is used to initialize the entire framework.

Results of the experiment

Table 3 shows the experimental results. In the first group of methods (MTransE, IPTransE, BootEA, RSNs) that only use structural information, BootEA and RSNs obtain better experimental results. This is because BootEA represents vectors using knowledge graphs designed for entity alignment tasks, and the bootstrap strategy proposed by BootEA can also improve alignment results. And the RSNs solve the limitation of adjacent structure information by excavating long-distance dependency relationship, so that the whole alignment effect is improved. However, the Hits @1 values for these methods did not exceed 50% on all datasets, revealing the disadvantage of using only structural features.

The second group of methods adopts entity attribute characteristics to supplement the structure characteristics, but neither JAPE nor GCN-Align achieves better effect than the first group, which can be attributed to the limitation of the effect of attribute information. In addition, the structural feature models used in both methods are inferior to BootEA and RSNs.

Table 2 entity alignment results

The third group of methods utilizes entity name information, greatly improves the alignment effect compared with the first group, proves the importance of the entity name information, and is particularly suitable for long-tail entities. In addition, compared with GM-Align, the method provided by the invention achieves nearly 20% improvement on the Hits @1 index, and all indexes are over nine times, thereby showing the effectiveness of the whole framework. Wherein results on a single language dataset are better than cross-language alignment results because entity name information in a single language is more helpful in judging the equivalence of entities.

It should be noted that GM-Align does not give alignment results for entities without valid entity name vectors. It is therefore believed that GM-Align is not able to Align these entities. Since the specific ordering results for these entities cannot be known, their Hits @10 and MRR values are not provided in table 2.

Feature analysis

The effectiveness of the proposed features is then analyzed, including a Basic entity alignment model (Basic) combining structure information and entity name information, a Basic iterative training framework (Basic + IT), a curriculum learning-based iterative training framework (Basic + IT-CL), and a word-shift distance-based re-ranking model (Basic + IT-CL + WMD). Specific experimental results are given in table 3.

It is easy to see that the basic entity alignment model combining the structure and the entity name information has achieved better effect than the methods of RSNs, GM-Align, etc., not only embodies the importance of the entity name features, but also reveals that the proposed feature fusion method is superior to the previous models. Iterative training further promotes various indexes, and confirms the positive influence of the amplification training data on the overall alignment effect and the effectiveness of high-confidence entities on the selection method. The course learning strategy brings about the improvement of Hits @1 index of over 2% on EN-FR and EN-DE data sets, and proves that the iterative training model can achieve a better effect. The effect of the method on the alignment of the single-language entities on the data set is not obvious, because most of the entities in the single-language data set are added into the training data in the first rounds, and the whole result is not greatly influenced by changing the adding sequence.

Finally, the reordering model based on word-shift distance allows significant improvements in Hits @1 indexing, especially on aligning datasets across language entities. This verifies that further mining of entity name information does lead to an increase in alignment accuracy. So far, all indexes on all data sets reach more than 90 percent (0.9), and the superiority of the performance of the method is shown.

TABLE 3 characterization

This example fully demonstrates that the entity alignment framework proposed by the present invention can effectively combine different features and strategies to improve the accuracy of entity alignment.

The main technical effects of the invention are as follows:

(1) an entity alignment basic framework which fuses the structural feature and the entity name feature is designed. On the basis of the course learning, an iterative training strategy is provided, and the addition mode of the high-confidence entity pair is changed, so that the training process is more easily optimized;

(2) and reordering the preorder alignment result by adopting a word shift distance model so as to fully mine entity name information and improve alignment accuracy.

Aiming at the problem that the structure information of the knowledge graph is deficient in a real world data set, the invention combines the entity name information which is not influenced by the degree of the entity node with the structure information to construct an entity alignment basic framework. In addition, the model effect is limited due to the fact that the insufficiency of the labeling data is noticed, an iterative training method based on course learning is designed, training data are easily and difficultly amplified, and the alignment accuracy is improved. And finally, on the basis of the previous two steps, further mining entity name information by using a word shifting distance model, reordering the preorder results and further generating a final alignment result. The method of the invention obtains better effect on the fusion application of a plurality of widely used knowledge maps.

The above examples are an implementation of the method for knowledge-graph fusion, but the implementation of the method is not limited by the examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be regarded as equivalent substitutions, and are included in the scope of the present invention.

Claims

1. A knowledge graph fusion method based on entity alignment is characterized by comprising the following steps:

step 1, acquiring data of two knowledge maps;

representing triplets in a graph, known entity pairs are represented as

In the step 4, an iterative training frame based on course learning is adopted for entity identification and alignment, wherein in the iterative training frame, the input of each iteration training is a knowledge graph to be aligned and an aligned entity pair, wherein the aligned entity pair is a training set, and the output is an alignment result and an amplified training set; acquiring a high-confidence entity pair, and adding the high-confidence entity pair into training data for the next round of training; when the high confidence entity pair in the test set is added into the training set, the high confidence entity pair will not appear in the next test set,the iterative training continues until the number of newly added entity pairs falls below a given threshold θ₂；

The high-confidence entity pair is to G₁Each entity e to be aligned in₁Suppose G₂With the nearest entity being e₂The second near entity is e₂' distance difference is delta₁＝D(e₁,e₂′)D(e₁,e₂). And for e₂If G is said₁With the nearest entity exactly e₁The second near entity is e₁' distance difference is delta₂＝D(e₂,e₁′)D(e₂,e₁) And Δ₁≥θ₁，Δ₂≥θ₁Then (e) is considered₁,e₂) Is a high confidence entity pair, θ₁Is a preset distance difference threshold.

2. The method of knowledge-graph fusion according to claim 1, wherein two-layer graph convolution networks are used in step 2 to process two knowledge-graph data and generate corresponding entity structure vectors, respectively;

entity e of two knowledge graphs in step 3₁∈G₁And e₂∈G₂The structural distance is D in structural space_s(e₁,e₂)＝||e₁-e₂||_l1/d_s，d_sIs the structural matrix dimension; the word characteristic distance is D_t(e₁,e₂)＝||ne(e₁)-ne(e₂)||_l1/d_tSuppose that entity e includes the word w in its name₁,w₂,...,w_pThen the entity name vector may be represented as the average of these word vectors, i.e.

Wherein w_iIs w_iWord vector of d_tIs the name vector matrix dimension;

the fusion formula of the comprehensive distance in the step 4 is as follows:

D(e₁,e₂)＝αD_s(e₁,e₂)+(1α)D_t(e₁,e₂)

wherein α is a hyperparameter used to adjust the weights of two features.

3. The method of claim 2, wherein the characteristic distance is calculated by a word-shift distance model, the word-shift distance model is used to measure the difference between different sentences, and the word-shift distance is expressed as the minimum distance value of the embedding vectors of all words in an entity that need to be moved to reach the embedding vectors of all words in another entity.

4. The method of knowledge-graph fusion of claim 2 or 3 wherein the input to the graph convolution network is a feature matrix of an entity

Wherein d is^lDimension representing the characteristic matrix of the l-th layer, for the first layer, H¹＝X，d¹P; the first layer output is

Wherein

I is an identity matrix and is a matrix of the identity,

is composed of

The diagonal matrix of (a) is,

5. The knowledge-graph fusion method of claim 4, characterized in that an initial feature matrix X is obtained by sampling from L2 regularized truncated normal distribution, and is updated through training of each layer of GCN, so as to fully capture structural information in the knowledge-graph and generate an output feature matrix Z; the dimension of the feature matrix is always set to d_s，P＝F＝d^l＝d_sAnd two GCNs share the feature matrix W in two layers¹And W²。

6. The method of knowledge-graph fusion of claim 5 wherein the training objective is to minimize the loss values:

wherein [ x ]]₊＝max{0,x}，

7. The method of knowledge-graph fusion of claim 1 or 6, wherein the difficulty level of the course is characterized by the degree of the entity node: entities with higher degrees have richer structural information and are easier to align; and the long-tail entity with low alignment degree is relatively difficult, and in the iterative training process, an easy entity pair is added firstly, and a difficult entity pair is added, so that the model is trained easily and difficultly.

8. The method of knowledge-graph fusion of claim 7 wherein δ courses are assumed from simple to difficult, c₁,…c_δRespectively representing a series of entity node degree values from large to small, and only selecting the node degree value larger than c in the high-confidence entity pair obtained by each iterative training₁Is added into the training set, and the condition is kept to carry out the iterative training in a loop until the number of the entity pairs meeting the requirement is lower than a given threshold value theta₂Stopping the training of the course difficulty;