CN110941722B - Knowledge graph fusion method based on entity alignment - Google Patents

Knowledge graph fusion method based on entity alignment Download PDF

Info

Publication number
CN110941722B
CN110941722B CN201910967655.9A CN201910967655A CN110941722B CN 110941722 B CN110941722 B CN 110941722B CN 201910967655 A CN201910967655 A CN 201910967655A CN 110941722 B CN110941722 B CN 110941722B
Authority
CN
China
Prior art keywords
entity
training
knowledge
alignment
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910967655.9A
Other languages
Chinese (zh)
Other versions
CN110941722A (en
Inventor
赵翔
曾维新
唐九阳
徐浩
谭真
殷风景
葛斌
肖卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910967655.9A priority Critical patent/CN110941722B/en
Publication of CN110941722A publication Critical patent/CN110941722A/en
Application granted granted Critical
Publication of CN110941722B publication Critical patent/CN110941722B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a knowledge graph fusion method based on entity alignment, which comprises the following steps: acquiring data of two knowledge maps; learning the structure vector of the entity by using a graph convolution network, and expressing the name of the entity as a word vector; calculating a composite distance between the entities to represent a degree of similarity between the entities; performing entity identification alignment by adopting an iterative training frame based on course learning; and according to the entity alignment result, fusing the two knowledge maps into one knowledge map. The method designs an entity alignment basic framework which integrates the structural feature and the entity name feature; an iterative training method based on course learning is designed, training data are easily and difficultly amplified, a word-shifting distance model is adopted to reorder the preorder alignment results, and entity name information is fully mined, so that the fusion of knowledge maps is more accurate and comprehensive.

Description

Knowledge graph fusion method based on entity alignment
Technical Field
The invention belongs to the field of knowledge graph generation and fusion, and particularly relates to a knowledge graph fusion method based on entity alignment.
Background
In recent years, a large number of knowledge maps (KGs) have emerged, such as YAGO, DBpedia, NELL, and CN-DBpedia, zhishi. The large-scale knowledge maps play an important role in intelligent services such as question-answering systems, personalized recommendation and the like. In addition, to meet specific domain-related needs, more and more domain knowledge maps, such as medical knowledge maps, are being derived. In the knowledge graph construction process, the trade-off between the coverage rate and the accuracy rate is inevitably needed. And any knowledge graph cannot be complete or completely correct.
In order to improve the coverage rate and accuracy of the knowledge graph, one possible method is to introduce relevant knowledge from other knowledge graphs, because the knowledge redundancies and complementation exist among the knowledge graphs constructed in different ways. For example, a constructed generic knowledge graph extracted from a web page may contain only the name of a drug, while more information may be found in a medical knowledge graph constructed based on medical data. To integrate knowledge in the external knowledge-graph into the target knowledge-graph, the most important step is to align the different knowledge-graphs. The Entity Alignment (EA) task aims to find pairs of entities in different knowledge-graphs that express the same meaning. And the entity pairs serve as hubs for linking different knowledge graphs to serve subsequent tasks.
At present, the mainstream entity alignment method mainly judges whether two entities point to the same thing by means of the structural features of a knowledge graph. Such methods assume that entities expressing the same meaning in different knowledge graphs have similar adjacent information. On artificially constructed data sets, this type of method achieves the best experimental results. But a recent work has indicated that these manually constructed data sets have a more dense knowledge-graph than the real-world knowledge-graph, and the structural feature-based entity alignment approach has a far less effective knowledge-graph with normal distribution.
In fact, by analyzing the distribution of entities in the real-world knowledge graph, more than half of the entities are connected to only one or two other entities. These entities, called long-tail entities (long-tail entities), occupy most of knowledge graph entities, so that the graph as a whole presents high sparsity. This also corresponds to the knowledge of the real world knowledge map: only a few entities are frequently used and have rich adjacency information; most entities are mentioned only rarely, containing little structural information. Therefore, current entity alignment methods based on structural information do not perform well on real-world datasets.
In addition, the lack of annotation data greatly limits the effectiveness of entity alignment. To map the representation vectors of different knowledge-graphs into the same space, enough annotation data is needed as a link. However, the number of known pairs of entities is limited. In order to solve the problem, part of methods propose to adopt Iterative Training (IT) to select high-confidence entity pairs from test set results to be used as next round of training, but have the problems of easy introduction of error samples, low efficiency and the like. In addition, on a data set with real world distribution, the iterative training frameworks can only introduce a small number of high-confidence entity pairs, and cannot bring obvious effect improvement.
Disclosure of Invention
In view of this, the present invention provides a knowledge graph fusion method based on entity alignment, which overcomes the shortcomings of the prior art, and is used for identifying and aligning the same or similar entities from a plurality of knowledge graphs, thereby implementing knowledge fusion of the plurality of knowledge graphs and improving the coverage rate and accuracy rate of the knowledge graphs.
Based on the above purpose, a knowledge graph fusion method based on entity alignment comprises the following steps:
step 1, acquiring data of two knowledge maps;
step 2, learning the structure vector of the entity by using a graph convolution network; representing the names of the entities as word vectors;
step 3, calculating the comprehensive distance between the entities to express the similarity degree between the entities;
step 4, adopting an iterative training frame based on course learning to perform entity identification alignment;
and 5, fusing the two knowledge maps into one knowledge map according to the entity alignment result.
The two knowledge maps are represented as G1=(E1,R1,T1) And G2=(E2,R2,T2) Wherein E represents an entity, R represents a relationship,
Figure BDA0002231041590000031
representing triplets in a graph, known entity pairs are represented as
Figure BDA0002231041590000032
The entity alignment task aims to find a new entity pair by utilizing the known entity pair information and generate a final alignment result
Figure BDA0002231041590000033
Wherein the equal sign represents that the two entities point to the same real world entity;
performing entity identification alignment on the iterative training frame based on course learning in the step 4, wherein in the iterative training frame, the input of each round of iterative training is a knowledge graph to be aligned and an aligned entity pair, wherein the aligned entity pair is a training set, and the output is an alignment result and an amplified training set; acquiring a high-confidence entity pair, and adding the high-confidence entity pair into training data for the next round of training; when the high confidence entity pair in the test set is added into the training set, the high confidence entity pair will not appear in the next round of test set, and the iterative training will continue until the number of newly added entity pairs is lower than the given threshold value theta2
The high-confidence entity pair refers to G1Each entity e to be aligned in1Suppose G2With the nearest entity being e2The second near entity is e2' distance difference is delta1=D(e1,e2′)-D(e1,e2). And for e2If G is said1With the nearest entity exactly e1The second near entity is e1' distance difference is delta2=D(e2,e1′)-D(e2,e1) And Δ1≥θ1,Δ2≥θ1Then (e) is considered1,e2) Is a high confidence entity pair,θ1Is a preset distance difference threshold.
Specifically, in the step 2, two-layer graph convolution networks are used for processing two knowledge graph data and generating corresponding entity structure vectors respectively;
entity e of two knowledge graphs in step 31∈G1And e2∈G2The structural distance is D in structural spaces(e1,e2)=||e1e2||l1/ds,dsIs the structural matrix dimension; the word characteristic distance is Dt(e1,e2)′||ne(e1)-ne(e2)||l1/dtSuppose that entity e includes the word w in its name1,w2,...,wpThen the entity name vector may be represented as the average of these word vectors, i.e.
Figure BDA0002231041590000041
Wherein wiIs wiWord vector of dtIs the name vector matrix dimension;
the fusion formula of the comprehensive distance in the step 4 is as follows:
D(e1,e2)=αDs(e1,e2)+(1α)Dt(e1,e2)
where α is the hyperparameter used to adjust the weights of the two features.
Preferably, the characteristic distance is calculated by a word-shift distance model, which is intended to measure the difference between different sentences, and the word-shift distance is expressed as the minimum distance value of embedded vectors of all words in an entity that need to be shifted to reach embedded vectors of all words in another entity.
Specifically, the input of the graph convolution network is a feature matrix of an entity
Figure BDA0002231041590000042
And an adjacency matrix A of the graph, and the output is a feature matrix with structure information
Figure BDA0002231041590000043
N represents the number of nodes in the graph, and P and F represent the dimensions of the input and output matrix features, respectively, assuming the input of the l-th layer as the feature matrix of the nodes
Figure BDA0002231041590000044
Wherein d islDimension representing the characteristic matrix of the l-th layer, for the first layer, H1=X, d1P; the first layer output is
Figure BDA0002231041590000045
Wherein
Figure BDA0002231041590000046
I is an identity matrix and is a matrix of the identity,
Figure BDA0002231041590000047
is composed of
Figure BDA0002231041590000048
The diagonal matrix of (a) is,
Figure BDA0002231041590000049
is a parameter matrix of the l-th layer, dl+1Is the dimension of the feature matrix of the next layer, the activation function σ is often set to ReLU, H for the last layerl+1=Z,dl+1=F。
Specifically, an initial characteristic matrix X is obtained by sampling from L2 regularized truncated normal distribution, and is updated through training of each layer of GCN, so that structural information in a knowledge graph is fully captured, and an output characteristic matrix Z is generated; the dimension of the feature matrix is always set to ds,P=F=dl=dsAnd two GCNs share the feature matrix W in two layers1And W2
Specifically, the training objective is to minimize the following loss values:
Figure BDA0002231041590000051
wherein [ x ]]+=max{0,x},
Figure BDA0002231041590000052
The representation is based on a known entity pair (e)1,e2) E is to be1Or e2And replacing the negative sample set generated by a random entity, wherein e represents a structure vector of the entity e, and gamma represents an end distance separating the positive sample from the negative sample, and performing model optimization by adopting random gradient descent.
Specifically, the difficulty level of the course can be characterized by the degree of the entity node: entities with higher degrees have richer structural information and are easier to align; and the long-tail entity with low alignment degree is relatively difficult, and in the iterative training process, an easy entity pair is added firstly, and a difficult entity pair is added, so that the model is trained easily and difficultly.
In particular, assume that there are δ courses, c, from simple to difficult1,…cδRespectively representing a series of entity node degree values from large to small, and only selecting the node degree value larger than c in the high-confidence entity pair obtained by each iterative training1Is added into the training set, and the condition is kept to carry out the iterative training in a loop until the number of the entity pairs meeting the requirement is lower than a given threshold value theta2Stopping the training of the course difficulty;
in the next training, the course difficulty is adjusted, and the condition is changed to select the degree greater than c from the high-confidence entity pair2Adding the new entity pairs into a training set, and keeping the course difficult to carry out loop iterative training until the number of the newly added entity pairs meeting the requirements is lower than a given value theta2Stopping the training of the course difficulty; finally, repeating the above steps to traverse the rest course difficulty c3,…cδ
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) an entity alignment basic framework which fuses the structural feature and the entity name feature is designed. Because the entity name and the structure information are mutually complemented, and the entity name is not influenced by the degree of the entity node, the basic framework can greatly improve the alignment result of the long-tail entity, and further optimize the overall alignment effect.
(2) On a basic entity alignment framework, an iterative training strategy based on Course Learning (CL) is designed, and the effect of entity alignment can be remarkably improved while the training efficiency is ensured. The method is inspired by the course learning idea, the entity node degree is used as a measurement index, the entity with higher degree is used as a simple course, the long-tail entity is used as a difficult course, the entity pair with high confidence level is added into a training set in a simple to difficult mode, the iterative training mode is optimized, the structural feature representation accuracy is improved, and the model training is easier to achieve the optimum.
(3) And (3) based on a Word Move's Distance (WMD) reordering model, namely, on the entity ordering result generated in the first two steps, the word move distance model is utilized to further mine entity name information and is combined with the structure information to optimize the entity alignment effect.
Drawings
Fig. 1 is a schematic overall flow chart of an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.
As shown in fig. 1, a knowledge-graph fusion method based on entity alignment includes the following steps:
step 1, acquiring data of two knowledge maps;
step 2, learning the structure vector of the entity by using a graph convolution network; representing the names of the entities as word vectors;
step 3, calculating the comprehensive distance between the entities to express the similarity between the entities;
step 4, adopting an iterative training framework based on course learning to perform entity identification alignment;
and 5, fusing the two knowledge maps into one knowledge map according to the entity alignment result.
For a better understanding of the present disclosure, all possible meanings of the symbols are given. Hl: layer i structural feature matrix, N: number of nodes, X: initial structural feature matrix, dl: layer i feature matrix dimension, Z: final structural feature matrix, S: known entity pair, a: adjacency matrix, ds: dimension of structural matrix, Wl: layer I parameter matrix, Ds: entity spacing under structural space, e: structural vector of entity e, dt: name vector matrix dimension, P: initial structural feature matrix dimension, F: final structural feature matrix dimension, N: name vector matrix of entity, D: distance between entities, G1: to-be-aligned knowledge graphs 1, G2: to-be-aligned knowledge graph 2, e1:G1Middle entity, e2:G2Middle entity, Δ1:e1Difference between the nearest two entities, Δ2:e2Difference in distance between two nearest entities, θ1: distance difference threshold, theta2: a threshold number of newly added entity pairs.
A formalized description of the entity alignment problem is given by two knowledge graphs, G1=(E1,R1,T1) And G2=(E2,R2,T2) Wherein E represents an entity, R represents a relationship,
Figure BDA0002231041590000074
representing triplets in the atlas. The known entity pair is represented as
Figure BDA0002231041590000071
The entity alignment task aims to find a new entity pair by utilizing the known entity pair information and generate a final alignment result
Figure BDA0002231041590000072
Where equal signs indicate that the two entities point to the same real world entity. Given an entity, the process of finding its corresponding entity in another knowledge graph can be considered as a ranking queryTo give a title. That is, under a certain feature space, the degree of similarity (distance) of a given entity to all entities in another knowledge-graph is calculated and given an ordering, and the entity with the highest degree of similarity (distance is the smallest) can be regarded as an alignment result.
Taking the medical knowledge map as an example, in order to obtain more medical knowledge, a plurality of independent medical knowledge maps can be fused, and in order to better fuse the medical knowledge map, entities in the medical knowledge map need to be identified, wherein the entities include names of medicines, names of diseases and names of symptoms. The three types of entities are the most basic entities of the medical knowledge graph, the alignment of the three types of entities is made, the most basic requirements of the medical knowledge graph are met, and the extraction of other entities can be determined according to actual needs.
The embodiment captures entity adjacency structure information and generates an entity structure representation vector by using a Graph Convolution Network (GCN). The GCN is a convolutional network that acts directly on graph structure data, generating corresponding node structure vectors by capturing the structure information around the nodes. The input to the GCN is a feature matrix of the entity
Figure BDA0002231041590000073
And the adjacency matrix a of the figure. The output is a feature matrix with structure information
Figure BDA0002231041590000081
N represents the number of nodes in the map, while P and F represent the dimensions of the input and output matrix features, respectively.
GCN models typically contain multiple GCN layers. In particular, assume that the input at layer I is a feature matrix of nodes
Figure BDA0002231041590000082
Wherein d islDimension representing the characteristic matrix of the l-th layer (for the first layer, H1=X,d1P). The first layer output is
Figure BDA0002231041590000083
Wherein
Figure BDA0002231041590000084
I is an identity matrix and is a matrix of the identity,
Figure BDA0002231041590000085
is composed of
Figure BDA0002231041590000086
The diagonal matrix of (a).
Figure BDA0002231041590000087
Is a parameter matrix of the l-th layer, dl+1Is the dimension of the next level feature matrix. The activation function σ is often set to ReLU. For the last layer, Hl +1+1=Z,=Z,dl+1+1=F。
In this embodiment, two-layer GCNs are constructed, each of which is used to process a knowledge graph and generate a corresponding entity vector, where an initial feature matrix X is obtained by sampling from a truncated normal distribution normalized by L2, and is updated through training of each layer of the GCN, so as to fully capture structural information in the knowledge graph and generate an output feature matrix Zs(P=F=dl=ds) And two GCNs share the feature matrix W in two layers1And W2
The entity structure vectors of different knowledge-graphs are not in the same space, so it is necessary to align them into the same space using a known entity pair S. A specific training objective is to minimize the following loss values:
Figure BDA0002231041590000088
wherein [ x ]]+=max{0,x},
Figure BDA0002231041590000089
The representation is based on a known entity pair (e)1,e2) E is to be1Or e2Instead, a set of negative examples generated by a random entity. e represents the structure vector of entity e. Gamma representsEnd distance separating positive and negative samples. Model optimization was performed using a random gradient descent.
Given the final structural feature matrix Z, e1∈G1And e2∈G2The distances under the structural space are:
Ds(e1,e2)=||e1-e2||l1/ds
if only the structural features are considered, the distance D between the target entity e and the structural featuressThe closest entity will be considered the corresponding entity of e.
Unlike the prior art, the present embodiment proposes to align with text features simultaneously. Specifically, in the text form of entity name, it is considered that 1) entity name is often used to identify entity and widely exists; 2) by comparing the entity names, whether the two entities are the same or not can be judged visually; 3) the method is not influenced by the scale of the training set and has stronger stability.
Although the conventional string comparison method can be used to measure the similarity between two entity names, the semantic similarity between the entity names is selected in this embodiment because it is also applicable when the knowledge maps are very different, such as the alignment of multi-language knowledge maps. Specifically, the average word vector representation is used as an entity name vector because it is simple and general, and semantic information can be expressed without a special corpus in consideration of the simplicity and the universality. Suppose that entity e includes the word w in its name1,w2,...,wpThen the entity name vector may be represented as the average of these word vectors, i.e.
Figure BDA0002231041590000091
Wherein wiIs wiThe word vector of (2). The name vector for all entities can be represented as N.
Similar to word vectors, similar entity names will be very close in vector space. e.g. of the type1∈G1And e2∈G2Distance under text feature space is Dt(e1,e2)=||ne(e1)-ne(e2)||l1/dt. If only the entity name characteristics are considered, the distance D between the entity name characteristics and the target entity e istThe closest entity will be considered the corresponding entity of e. For cross-language entity alignment, pre-training cross-language word vectors can be utilized, thereby ensuring that cross-language entity name vectors are in the same space.
Given that structural and name features delineate entities from two distinct aspects, structural and semantic, respectively, they can be further combined to provide a more comprehensive alignment cue. In particular, two entities e1∈G1And e2∈G2The distance between them is:
D(e1,e2)=αDs(e1,e2)+(1-α)Dt(e1,e2)
where α is the hyperparameter used to adjust the weights of the two features. And under the space after the characteristic fusion, the entity closest to the target entity e by the distance D is regarded as the corresponding entity of e.
The number of labeled data is limited, and vectors of different knowledge maps cannot be effectively mapped into the same space, so that the effect of entity alignment is limited. Therefore, the present embodiment proposes to add the entity alignment result with high confidence to the next round of training data from simple to difficult, iteratively expand the training set size and improve the entity alignment result. First, a basic iterative training framework is introduced, and then how the idea of curriculum learning is applied to the iterative framework to optimize the training effect is explained.
The input of each round of iterative training is the knowledge graph to be aligned and the aligned entity pair (training set), and the output is the alignment result and the amplified training set. One of the simplest amplification methods is for G1Each entity e to be aligned in1Suppose G2With the nearest entity being e2(ii) a And for e2In other words, G1The nearest entity in the middle is exactly e1Then (e) can be considered1,e2) Are high confidence entity pairs and are added to the training data. However, in this process, a part of wrong entity pairs is inevitably introduced, thereby causing negative effects on the subsequent training. Once the wrong entity pair is added, it is difficult to re-evaluate the entitiesThe correctness of the body pair is either removed from the training data.
To this end, the present embodiment proposes a simple method that can greatly reduce the probability of introducing erroneous pairs of entities. For G1Each entity e to be aligned in1Suppose G2With the nearest entity being e2The second near entity is e2' distance difference is delta1=D(e1,e2′)-D(e1,e2). And for e2If G is said1The nearest entity of the middle distance is exactly e1The second near entity is e1' distance difference is delta2=D(e2,e1′)-D(e2,e1) And Δ1≥θ1,Δ2≥θ1Then (e) can be considered1,e2) Are high confidence entity pairs and are added to the training data for the next round of training. The above method has a higher selection criterion for high confidence entity pairs: the distance between two entities is the closest from both sides, and there is a certain distance difference between the closest entity and the second closest entity. This ensures to some extent the correct rate of newly joining entity pairs. The iterative training continues until the number of newly added entity pairs falls below a given threshold θ2
It should be noted that in the iterative training framework designed in this embodiment, when the high-confidence entity pairs in the test set are added to the training set, the high-confidence entity pairs will not appear in the next test set, i.e. the number of entities in the test set is continuously reduced. This can improve the alignment effect of the remaining entities in the test set to some extent, because the number of candidate entities is greatly reduced compared to the original. Experimental results show that the iterative training framework provided by the invention can bring better effect.
The main idea of course learning is to simulate the characteristics of human learning, and from simple to difficult learning, the model can find local optimum more easily, and the training speed is accelerated. In the entity alignment task, the difficulty level of the course can be described by the degree of the entity node: entities with higher degrees have richer structural information and are easier to align; whereas long tail entities with low degrees of alignment are relatively difficult. Therefore, in the iterative training process, the easy entity pairs are added firstly, and the difficult entity pairs are added, so that the model is trained easily and difficultly, and the training is easier to achieve the optimal.
In particular, assume that there are δ courses, c, from simple to difficult1,…cδRespectively representing a series of entity node degree values from large to small, and only selecting the node degree value larger than c in the high-confidence entity pair obtained by each iterative training1Is added into the training set, and the condition is kept to carry out the iterative training in a loop until the number of the entity pairs meeting the requirement is lower than a given threshold value theta2And stopping the training of the course difficulty.
In the next training, the course difficulty is adjusted, and the condition is changed to select the degree greater than c from the high-confidence entity pair2Adding the new entity pairs into a training set, and keeping the course difficult to carry out loop iterative training until the number of the newly added entity pairs meeting the requirements is lower than a given value theta2And stopping the training of the course difficulty. Finally, repeating the above steps to traverse the rest course difficulty c3,…cδ
Iterative training based on course learning generates more accurate entity representation vectors by optimizing the adding mode of high-confidence entity pairs, and further improves the alignment effect.
The iterative training framework based on course learning has greatly improved the accuracy of entity alignment, and on the basis, further mining entity name information is provided, and a word-shifting distance model is adopted to reorder the preorder results and optimize the entity alignment effect.
The word-shift distance model aims to measure the difference between different sentences, and represents that the embedded vectors of all words in one sentence need to be shifted to reach the minimum distance value of the embedded vectors of all words in another sentence. Compared with the distance between average word vectors, the word shift distance can better depict the influence of each word in the sentence on the whole sentence, and the semantic loss caused by average operation is avoided. However, this model is time consuming due to the need to compute word-level distances, and is not suitable for large-scale data. For this reason, the method is not used to calculate the distance between entity names from the beginning, but rather it is used to reorder the preamble results.
Specifically, after the iterative training based on course learning is finished, for each entity to be aligned in the test set, h entities closest to the entity to be aligned in another knowledge graph are reserved and are sent to a word-shift distance model as input, and the distance between the entities in the entity namespace is recalculated. And finally, calculating to obtain a new distance between entities and a reordered alignment result by using the updated entity name distance and combining the calculation formula.
In order to ensure the practicability and effectiveness of the method, relevant test experiments are carried out, and basic settings of the experiments are introduced firstly, wherein the basic settings comprise parameter settings, data sets, comparison methods and measurement indexes. The experimental results on both the cross-language entity alignment and the single-language entity alignment are then presented. Feature analysis is then performed to verify the validity of each module. Finally, a clearer understanding of the framework of the invention is provided through case analysis.
Parameter setting and measurement index
For solid structural features, ds300 rounds of training, 300, generate five negative examples for each positive example. For the entity name features, the entity name vector is generated by using the fastText pre-training word vector, and the cross-language word vector is obtained through MUSE. Wherein the fastText vector is obtained by training a CBOW model, and the dimension is 300 (namely d)t300), the character length is 5, the window size is 5, and the negative-positive scaling is 10. The hyper-parameter a was set to 0.3 by a proof set experiment. For an iterative training framework based on course learning, θ1=0.05,θ2=20。 c1,…cδ10, 6, 4, 2, 0 and δ 5. In the word shift distance model, h is 100.
Hits @ k (k ═ 1,10), and Mean Reciprocal Rank (MRR) were used as metrics. For each entity in the test set, the entities in the other knowledge-graph are ranked from low to high according to the distance D from the entity. Hits @ k reflects the proportion of the first k entities that contain the correct entity. In particular, Hits @1 represents the accuracy of the alignment. MRR represents the inverse of the average ranking of the correct entities. Although Hits @1 is the most important measure, Hits @10 can be considered as a complement to Hits @ 1. Assuming that some method fails to rank the correct entity as the closest entity, if it ranks the correct entity as the first 10 near entity, this method is at least better than the method that does not rank the correct entity as the first 10 near entity. MRRs can also provide similar information supplements. Note that high Hits @ k and MRR represent better experimental results, with Hits @ k in the experiment being expressed as a percentage.
Data set and comparison method
The present embodiment will test the proposed method on two cross-language entity aligned datasets EN-FR, EN-DE and two single-language entity aligned datasets DBP-WD, DBP-YG. See table 1 for detailed data set information.
Table 1 data set overview
Figure BDA0002231041590000131
In addition, a comparison is made with the following method.
MTransE: methods for entity alignment using knowledge-graph embedding (TransE) were first proposed.
IPtransE: and an iterative training frame is adopted to improve the alignment effect.
BootEA: an alignment-based knowledge graph embedding method and a bootstrap strategy are designed.
JAPE: optimizing structural information using attribute information
GCN-Align: entity vectors are generated using the GCN and combined with attribute vectors to align entities.
RSNs: a cyclic neural network based on residual error learning is adopted to effectively capture long-distance relation dependence inside the knowledge graph and between the knowledge graphs.
GM-Align: a local entity graph is constructed for each entity to capture more local information. Entity name information is used to initialize the entire framework.
Results of the experiment
Table 3 shows the experimental results. In the first group of methods (MTransE, IPTransE, BootEA, RSNs) that only use structural information, BootEA and RSNs obtain better experimental results. This is because BootEA represents vectors using knowledge graphs designed for entity alignment tasks, and the bootstrap strategy proposed by BootEA can also improve alignment results. And the RSNs solve the limitation of adjacent structure information by excavating long-distance dependency relationship, so that the whole alignment effect is improved. However, the Hits @1 values for these methods did not exceed 50% on all datasets, revealing the disadvantage of using only structural features.
The second group of methods adopts entity attribute characteristics to supplement the structure characteristics, but neither JAPE nor GCN-Align achieves better effect than the first group, which can be attributed to the limitation of the effect of attribute information. In addition, the structural feature models used in both methods are inferior to BootEA and RSNs.
Table 2 entity alignment results
Figure BDA0002231041590000141
The third group of methods utilizes entity name information, greatly improves the alignment effect compared with the first group, proves the importance of the entity name information, and is particularly suitable for long-tail entities. In addition, compared with GM-Align, the method provided by the invention achieves nearly 20% improvement on the Hits @1 index, and all indexes are over nine times, thereby showing the effectiveness of the whole framework. Wherein results on a single language dataset are better than cross-language alignment results because entity name information in a single language is more helpful in judging the equivalence of entities.
It should be noted that GM-Align does not give alignment results for entities without valid entity name vectors. It is therefore believed that GM-Align is not able to Align these entities. Since the specific ordering results of these entities cannot be known, their Hits @10 and MRR values are not provided in table 2.
Feature analysis
The effectiveness of the proposed features is then analyzed, including a Basic entity alignment model (Basic) combining structure information and entity name information, a Basic iterative training framework (Basic + IT), a curriculum learning-based iterative training framework (Basic + IT-CL), and a word-shift distance-based re-ranking model (Basic + IT-CL + WMD). Specific experimental results are given in table 3.
It is easy to see that the basic entity alignment model combining the structure and the entity name information has achieved better effect than the methods of RSNs, GM-Align, etc., not only embodies the importance of the entity name features, but also reveals that the proposed feature fusion method is superior to the previous models. Iterative training further promotes various indexes, and confirms the positive influence of the amplification training data on the overall alignment effect and the effectiveness of high-confidence entities on the selection method. The course learning strategy brings about the improvement of Hits @1 index of over 2% on EN-FR and EN-DE data sets, and proves that the iterative training model can achieve a better effect. The effect of the method on the alignment of the single-language entities on the data set is not obvious, because most of the entities in the single-language data set are added into the training data in the first rounds, and the whole result is not greatly influenced by changing the adding sequence.
Finally, the reordering model based on word-shift distance allows significant improvements in Hits @1 indexing, especially on aligning datasets across language entities. This verifies that further mining of entity name information does lead to an increase in alignment accuracy. So far, all indexes on all data sets reach more than 90 percent (0.9), and the superiority of the performance of the method is shown.
TABLE 3 characterization
Figure 1
This example fully demonstrates that the entity alignment framework proposed by the present invention can effectively combine different features and strategies to improve the accuracy of entity alignment.
The main technical effects of the invention are as follows:
(1) an entity alignment basic framework which fuses the structure characteristic and the entity name characteristic is designed. On the basis of the course learning, an iterative training strategy is provided, and the addition mode of the high-confidence entity pair is changed, so that the training process is more easily optimized;
(2) and reordering the preorder alignment result by adopting a word shift distance model so as to fully mine entity name information and improve alignment accuracy.
Aiming at the problem that the structure information of the knowledge graph is deficient in a real world data set, the invention combines the entity name information which is not influenced by the degree of the entity node with the structure information to construct an entity alignment basic framework. In addition, the model effect is limited due to the fact that the insufficiency of the labeling data is noticed, an iterative training method based on course learning is designed, training data are easily and difficultly amplified, and the alignment accuracy is improved. And finally, on the basis of the previous two steps, further mining entity name information by using a word shifting distance model, reordering the preorder results and further generating a final alignment result. The method of the invention obtains better effect on the fusion application of a plurality of widely used knowledge maps.
The above examples are an implementation of the method for knowledge-graph fusion, but the implementation of the method is not limited by the examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be regarded as equivalent substitutions, and are included in the scope of the present invention.

Claims (3)

1. A knowledge graph fusion method based on entity alignment is characterized by comprising the following steps:
step 1, acquiring data of two knowledge graphs;
step 2, learning the structure vector of the entity by using a graph convolution network; representing the names of the entities as word vectors;
step 3, calculating the comprehensive distance between the entities to express the similarity degree between the entities;
step 4, adopting an iterative training frame based on course learning to perform entity identification alignment;
step 5, according to the entity alignment result, the two knowledge maps are fused into one knowledge map;
the two knowledge maps are represented as G1=(E1,R1,T1) And G2=(E2,R2,T2) Wherein E represents an entity, R represents a relationship,
Figure FDA0003643560070000011
representing triplets in a graph, known entity pairs are represented as
Figure FDA0003643560070000012
In the step 4, an iterative training frame based on course learning is adopted for entity identification and alignment, wherein in the iterative training frame, the input of each iteration training is a knowledge graph to be aligned and an aligned entity pair, wherein the aligned entity pair is a training set, and the output is an alignment result and an amplified training set; acquiring a high-confidence entity pair, and adding the high-confidence entity pair into training data for the next round of training; when the high-confidence entity pairs in the test set are added into the training set, the high-confidence entity pairs will not appear in the next round of test set, and the iterative training will continue until the number of newly added entity pairs is lower than a given threshold value theta2
The high-confidence entity pair is to G1Each entity e to be aligned in1Suppose G2With the nearest entity being e2The second near entity is e2' distance difference is delta1=D(e1,e2′)-D(e1,e2) (ii) a And for e2In other words, if G1With the nearest entity exactly e1The second near entity is e1' distance difference is delta2=D(e2,e1′)-D(e2,e1) And Δ1≥θ1,Δ2≥θ1Then (e) is considered1,e2) For high confidence entity pairs, θ1Is a preset distance difference threshold value;
in the step 2, two-layer graph convolution networks are used for processing two knowledge graph data and generating corresponding entity structure vectors respectively;
entity e of two knowledge graphs in step 31∈G1And e2∈G2The structural distance under the structural space is Ds(e1,e2)=||e1-e2||l1/ds,dsIs the structural matrix dimension; the word characteristic distance is Dt(e1,e2)=||ne(e1)-ne(e2)||l1/dtSuppose that entity e includes the word w in its name1,w2,...,wpThen the entity name vector may be represented as the average of these word vectors, i.e.
Figure FDA0003643560070000021
Wherein wiIs wiWord vector of dtIs the name vector matrix dimension;
the fusion formula of the comprehensive distance in the step 4 is as follows:
D(e1,e2)=αDs(e1,e2)+(1-α)Dt(e1,e2)
where α is a hyperparameter used to adjust the weights of the two features;
the characteristic distance is calculated through a word moving distance model, the word moving distance model aims to measure the difference among different sentences, and the word moving distance is represented as the minimum distance value of embedded vectors of all words in an entity which need to move to reach embedded vectors of all words in another entity;
the input of the graph convolution network is a characteristic matrix of an entity
Figure FDA0003643560070000022
And an adjacency matrix A of the graph, and the output is a feature matrix with structure information
Figure FDA0003643560070000023
N represents the number of nodes in the graph, and P and F represent the dimensions of the input and output matrix features, respectively, assuming the input of the l-th layer as the feature matrix of the nodes
Figure FDA0003643560070000024
Wherein d islDimension representing the characteristic matrix of the l-th layer, for the first layer, H1=X,d1P; the first layer output is
Figure FDA0003643560070000025
Wherein
Figure FDA0003643560070000026
I is an identity matrix and is a matrix of the identity,
Figure FDA0003643560070000027
is composed of
Figure FDA0003643560070000028
The diagonal matrix of (a) is,
Figure FDA0003643560070000029
is a parameter matrix of the l-th layer, dl+1Is the dimension of the feature matrix of the next layer, the activation function σ is often set to ReLU, H for the last layerl+1=Z,dl+1=F;
The initial characteristic matrix X is obtained by sampling from L2 regularized truncated normal distribution, and is updated through training of each layer of GCN, so that structural information in a knowledge graph is fully captured, and an output characteristic matrix Z is generated; the dimension of the feature matrix is always set to ds,P=F=dl=dsAnd two GCNs share the feature matrix W in two layers1And W2
The training objective was to minimize the following loss values:
Figure FDA00036435600700000210
wherein [ x ]]+=max{0,x},
Figure FDA00036435600700000211
The representation is based on a known entity pair (e)1,e2) E is to be1Or e2And replacing the negative sample set generated by a random entity, wherein e represents a structure vector of the entity e, and gamma represents an end distance separating the positive sample from the negative sample, and performing model optimization by adopting random gradient descent.
2. The method of knowledge-graph fusion of claim 1 wherein the difficulty level of the lesson is characterized by the degree of the entity node: entities with higher degrees have richer structural information and are easier to align; and the long-tail entity with low alignment degree is relatively difficult, and in the iterative training process, an easy entity pair is added firstly, and a difficult entity pair is added, so that the model is trained easily and difficultly.
3. The method of knowledge-graph fusion of claim 2 wherein δ courses are assumed from simple to difficult, c1,…cδRespectively representing a series of entity node degree values from large to small, and only selecting the node degree value larger than c in the high-confidence entity pair obtained by each iterative training1Is added into the training set, and the condition is kept to carry out the iterative training in a loop until the number of the entity pairs meeting the requirement is lower than a given threshold value theta2Stopping the training of the course difficulty;
in the next training, the course difficulty is adjusted, and the condition is changed to select the degree greater than c from the high-confidence entity pair2Adding the new entity pairs into a training set, and keeping the course difficult to carry out loop iterative training until the number of the newly added entity pairs meeting the requirements is lower than a given value theta2Stopping the training of the course difficulty; finally, repeating the above steps to traverse the rest course difficulty c3,...cδ
CN201910967655.9A 2019-10-12 2019-10-12 Knowledge graph fusion method based on entity alignment Active CN110941722B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910967655.9A CN110941722B (en) 2019-10-12 2019-10-12 Knowledge graph fusion method based on entity alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910967655.9A CN110941722B (en) 2019-10-12 2019-10-12 Knowledge graph fusion method based on entity alignment

Publications (2)

Publication Number Publication Date
CN110941722A CN110941722A (en) 2020-03-31
CN110941722B true CN110941722B (en) 2022-07-01

Family

ID=69905917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910967655.9A Active CN110941722B (en) 2019-10-12 2019-10-12 Knowledge graph fusion method based on entity alignment

Country Status (1)

Country Link
CN (1) CN110941722B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563192B (en) * 2020-04-28 2023-05-30 腾讯科技(深圳)有限公司 Entity alignment method, device, electronic equipment and storage medium
CN111931505A (en) * 2020-05-22 2020-11-13 北京理工大学 Cross-language entity alignment method based on subgraph embedding
CN112016601B (en) * 2020-08-17 2022-08-05 华东师范大学 Network model construction method based on knowledge graph enhanced small sample visual classification
CN112131395B (en) * 2020-08-26 2023-09-26 浙江工业大学 Iterative knowledge graph entity alignment method based on dynamic threshold
CN111813962B (en) * 2020-09-07 2020-12-18 北京富通东方科技有限公司 Entity similarity calculation method for knowledge graph fusion
CN112084347B (en) * 2020-09-15 2023-08-25 东北大学 Knowledge representation learning-based data retrieval method and system
CN112131404B (en) * 2020-09-19 2022-09-27 哈尔滨工程大学 Entity alignment method in four-risk one-gold domain knowledge graph
CN112445876B (en) * 2020-11-25 2023-12-26 中国科学院自动化研究所 Entity alignment method and system for fusing structure, attribute and relationship information
CN112559759A (en) * 2020-12-03 2021-03-26 云知声智能科技股份有限公司 Method and equipment for identifying error relation in knowledge graph
CN112287126B (en) * 2020-12-24 2021-03-19 中国人民解放军国防科技大学 Entity alignment method and device suitable for multi-mode knowledge graph
CN112784065B (en) * 2021-02-01 2023-07-14 东北大学 Unsupervised knowledge graph fusion method and device based on multi-order neighborhood attention network
CN113111657B (en) * 2021-03-04 2024-05-03 浙江工业大学 Cross-language knowledge graph alignment and fusion method, device and storage medium
CN112765370B (en) * 2021-03-29 2021-07-06 腾讯科技(深圳)有限公司 Entity alignment method and device of knowledge graph, computer equipment and storage medium
CN112818137B (en) * 2021-04-19 2022-04-08 中国科学院自动化研究所 Entity alignment-based multi-source heterogeneous knowledge graph collaborative reasoning method and device
CN113360673B (en) * 2021-06-21 2023-07-07 浙江师范大学 Entity alignment method, device and storage medium of multi-mode knowledge graph
CN113420161B (en) * 2021-06-24 2024-07-02 平安科技(深圳)有限公司 Node text fusion method and device, computer equipment and storage medium
CN113761221B (en) * 2021-06-30 2022-02-15 中国人民解放军32801部队 Knowledge graph entity alignment method based on graph neural network
CN113656596B (en) * 2021-08-18 2022-09-20 中国人民解放军国防科技大学 Multi-modal entity alignment method based on triple screening fusion
CN113407759B (en) * 2021-08-18 2021-11-30 中国人民解放军国防科技大学 Multi-modal entity alignment method based on adaptive feature fusion
CN114036307B (en) * 2021-09-17 2022-09-13 清华大学 Knowledge graph entity alignment method and device
CN114090783A (en) * 2021-10-15 2022-02-25 北京大学 Heterogeneous knowledge graph fusion method and system
CN114003735B (en) * 2021-12-24 2022-03-18 北京道达天际科技有限公司 Knowledge graph question and answer oriented entity disambiguation method based on intelligence document
CN114564597B (en) * 2022-03-03 2024-09-17 上海工程技术大学 Entity alignment method integrating multidimensional and multi-information
CN116628247B (en) * 2023-07-24 2023-10-20 北京数慧时空信息技术有限公司 Image recommendation method based on reinforcement learning and knowledge graph
CN118364428B (en) * 2024-06-18 2024-08-20 安徽思高智能科技有限公司 RPA-oriented multi-mode entity alignment automatic fusion method, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9773031B1 (en) * 2016-04-18 2017-09-26 Color Genomics, Inc. Duplication and deletion detection using transformation processing of depth vectors
CN107480191A (en) * 2017-07-12 2017-12-15 清华大学 A kind of entity alignment model of iteration
CN110188206A (en) * 2019-05-08 2019-08-30 北京邮电大学 Collaboration iterative joint entity alignment schemes and device based on translation model
CN110245131A (en) * 2019-06-05 2019-09-17 江苏瑞中数据股份有限公司 Entity alignment schemes, system and its storage medium in a kind of knowledge mapping

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9773031B1 (en) * 2016-04-18 2017-09-26 Color Genomics, Inc. Duplication and deletion detection using transformation processing of depth vectors
CN107480191A (en) * 2017-07-12 2017-12-15 清华大学 A kind of entity alignment model of iteration
CN110188206A (en) * 2019-05-08 2019-08-30 北京邮电大学 Collaboration iterative joint entity alignment schemes and device based on translation model
CN110245131A (en) * 2019-06-05 2019-09-17 江苏瑞中数据股份有限公司 Entity alignment schemes, system and its storage medium in a kind of knowledge mapping

Also Published As

Publication number Publication date
CN110941722A (en) 2020-03-31

Similar Documents

Publication Publication Date Title
CN110941722B (en) Knowledge graph fusion method based on entity alignment
CN110955780B (en) Entity alignment method for knowledge graph
CN106650789B (en) Image description generation method based on depth LSTM network
US20230206127A1 (en) Knowledge graph fusion method based on iterative completion
Alayrac et al. Unsupervised learning from narrated instruction videos
CN106610930B (en) Foreign language writing methods automatic error correction method and system
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN107704456B (en) Identification control method and identification control device
CN114090783A (en) Heterogeneous knowledge graph fusion method and system
CN105068997B (en) The construction method and device of parallel corpora
CN107590139B (en) Knowledge graph representation learning method based on cyclic matrix translation
JP6715492B2 (en) Identification control method and identification control device
Padó et al. Who sides with whom? Towards computational construction of discourse networks for political debates
CN110941720A (en) Knowledge base-based specific personnel information error correction method
CN107832297B (en) Feature word granularity-oriented domain emotion dictionary construction method
CN114625882B (en) Network construction method for improving unique diversity of image text description
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN117094311B (en) Method for establishing error correction filter for Chinese grammar error correction
CN115034221B (en) Overlapping relation extraction system based on BiLSTM combined with global pointer
CN110516240A (en) A kind of Semantic Similarity Measurement model DSSM technology based on Transformer
CN115438197A (en) Method and system for complementing relationship of matter knowledge map based on double-layer heterogeneous graph
CN104572632B (en) A kind of method in the translation direction for determining the vocabulary with proper name translation
CN113220908A (en) Knowledge graph matching method and device
CN113420766B (en) Low-resource language OCR method fusing language information
CN109753966A (en) A kind of Text region training system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant