CN110955780B

CN110955780B - Entity alignment method for knowledge graph

Info

Publication number: CN110955780B
Application number: CN201910968049.9A
Authority: CN
Inventors: 赵翔; 曾维新; 唐九阳; 徐浩; 谭真; 殷风景; 葛斌; 肖卫东
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2022-10-14
Anticipated expiration: 2039-10-12
Also published as: CN110955780A

Abstract

The invention discloses an entity alignment method for a knowledge graph, which comprises the following steps: acquiring data of two knowledge maps; learning the structure vector of the entity by using a graph convolution network, and expressing the name of the entity as a word vector; calculating the structure distance and the word characteristic distance of the entity; fusing the two distances into a comprehensive distance to represent the similarity degree of the entities; and carrying out entity identification alignment according to the calculation result of the similarity degree. The method designs an entity alignment basic framework which integrates the structural feature and the entity name feature; and reordering the preorder alignment result by adopting a word shift distance model so as to fully mine entity name information and improve the accuracy and timeliness of entity alignment.

Description

Entity alignment method for knowledge graph

Technical Field

The invention belongs to the field of knowledge graph data processing, and particularly relates to an entity alignment method for a knowledge graph.

Background

In recent years, a large number of Knowledge Graphs (KGs) such as YAGO, DBpedia, NELL, and CN-DBpedia, zhishi. The large-scale knowledge maps play an important role in intelligent services such as question-answering systems, personalized recommendation and the like. In addition, to meet specific domain-related needs, more and more domain knowledge maps, such as medical knowledge maps, are being derived. In the process of knowledge graph construction, the trade-off between the coverage rate and the accuracy rate is inevitably needed. And any knowledge graph cannot be complete or completely correct.

In order to improve the coverage rate and accuracy of the knowledge graph, one possible method is to introduce relevant knowledge from other knowledge graphs, because the knowledge redundancies and complementation exist among the knowledge graphs constructed in different ways. For example, a constructed generic knowledge graph extracted from a web page may contain only the name of a drug, while more information may be found in a medical knowledge graph constructed based on medical data. To integrate knowledge in the external knowledge-graph into the target knowledge-graph, the most important step is to align the different knowledge-graphs. For this reason, an Entity Alignment (EA) task is proposed and receives a wide attention. The task is to find pairs of entities in different knowledge graphs that express the same meaning. And the entity pairs serve as hubs for linking different knowledge graphs to serve subsequent tasks.

At present, the mainstream entity alignment method mainly judges whether two entities point to the same thing by means of the structural features of a knowledge graph. Such methods assume that entities expressing the same meaning in different knowledge graphs have similar adjacent information. On artificially constructed data sets, this type of method achieves the best experimental results. But a recent work has indicated that these manually constructed data sets have a more dense knowledge-graph than the real-world knowledge-graph, and the structural feature-based entity alignment approach has a far less effective knowledge-graph with normal distribution.

In fact, by analyzing the distribution of entities in the real-world knowledge graph, more than half of the entities are connected to only one or two other entities. These entities are called long-tail entities (long-tail entities) and occupy most of knowledge graph entities, so that the graph as a whole presents high sparsity. This also corresponds to the knowledge of the real world knowledge map: only a few entities are frequently used and have rich adjacency information; most entities are mentioned only rarely, containing little structural information. Therefore, current entity alignment methods based on structural information do not perform well on real-world datasets.

Disclosure of Invention

In view of the above, the present invention is directed to an entity alignment method for a knowledge graph, which overcomes the disadvantage of performing entity alignment only by using entity structure information in the prior art, and makes full use of the entity structure information and the entity name information to be comprehensively used for entity alignment, thereby improving the alignment efficiency.

Based on the above purpose, the present invention provides an entity alignment method for knowledge graph, which includes the following steps:

step 1, acquiring data of two knowledge maps;

step 2, learning the structure vector of the entity by using a graph convolution network; representing the names of the entities as word vectors;

step 3, calculating the structure distance and the word characteristic distance of the entity;

step 4, fusing the two distances into a comprehensive distance to represent the similarity of the entities;

and 5, performing entity identification alignment according to the calculation result of the similarity degree to obtain a similar entity pair.

The two knowledge maps are represented as G ₁ ＝(E ₁ ,R ₁ ,T ₁ ) And G ₂ ＝(E ₂ ,R ₂ ,T ₂ ) Wherein E represents an entity, R represents a relationship,

representing triplets in a graph, pairs of known entities represented as

The entity alignment task aims to find a new entity pair by utilizing the known entity pair information and generate a final alignment result

Wherein the equal sign represents that the two entities point to the same real world entity;

in the step 2, two-layer graph convolution networks are used for processing two knowledge graph data and generating corresponding entity structure vectors respectively;

entity e of two knowledge graphs in step 3 ₁ ∈G ₁ And e ₂ ∈G ₂ The structural distance is D in structural space _s (e ₁ ,e ₂ )＝||e ₁ -e ₂ || _l1 /d _s ，d _s Is the structural matrix dimension; the word characteristic distance is D _t (e ₁ ,e ₂ )＝||ne(e ₁ )-ne(e ₂ )|| _l1 /d _t Suppose that entity e includes the word w in its name ₁ ,w ₂ ,...,w _p Then the entity name vector may be represented as the average of these word vectors, i.e.

Wherein w _i Is w _i Word vector of d _t Is the name vector matrix dimension;

the fusion formula of the comprehensive distance in the step 4 is as follows:

D(e ₁ ,e ₂ )＝αD _s (e ₁ ,e ₂ )+(1α)D _t (e ₁ ,e ₂ )

where α is the hyperparameter used to adjust the weights of the two features.

Preferably, the characteristic distance is calculated by a word-shift distance model, which is intended to measure the difference between different sentences, and the word-shift distance is expressed as the minimum distance value of embedded vectors of all words in an entity that need to be shifted to reach embedded vectors of all words in another entity.

Specifically, the input of the graph convolution network is the characteristic matrix of the entity

And an adjacency matrix A of the graph, and the output is a feature matrix with structure information

N represents the number of nodes in the graph, and P and F represent the dimensions of the input and output matrix features, respectively, assuming the input of the l-th layer as the feature matrix of the nodes

Wherein d is ^l Dimension representing the characteristic matrix of the l-th layer, for the first layer, H ¹ ＝X，d ¹ = P; the first layer output is

Wherein

I is a unit matrix, and the unit matrix is,

is composed of

The diagonal matrix of (a) is,

is a parameter matrix of the l-th layer, d ^l+1 Is the dimension of the feature matrix of the next layer, the activation function σ is often set to ReLU, H for the last layer ^l+1 ＝Z，d ^l+1 ＝F。

Specifically, an initial feature matrix X is obtained by sampling from L2 regularized truncated normal distribution, and is updated through training of each layer of GCN, so that structural information in a knowledge graph is fully captured, and an output feature matrix Z is generated; the dimension of the feature matrix is always set to d _s ，P＝F＝d ^l ＝d _s And two GCNs share the feature matrix W in two layers ¹ And W ² 。

Specifically, the training objective is to minimize the following loss values:

wherein [ x ]] ₊ ＝max{0,x}，

The representation is based on a known entity pair (e) ₁ ,e ₂ ) E is to be ₁ Or e ₂ And replacing the negative sample set generated by a random entity, wherein e represents a structure vector of the entity e, and gamma represents an end distance separating the positive sample from the negative sample, and performing model optimization by adopting random gradient descent.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) An entity alignment basic framework which fuses the structural feature and the entity name feature is designed. The entity name and the structural information are mutually complemented, and the basic framework can greatly improve the alignment result of the long-tail entity, so that the overall alignment effect is optimized.

(2) And a word shift distance model is adopted to reorder the preorder alignment results so as to fully mine entity name information and improve alignment accuracy.

Drawings

Fig. 1 is a schematic overall flow chart of an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

As shown in fig. 1, an entity alignment method for a knowledge-graph includes the following steps:

step 1, acquiring data of two knowledge maps;

For a better understanding of the present disclosure, all possible meanings of the symbols are given. H ^l : layer i structural feature matrix, N: number of nodes, X: initial structural feature matrix, d ^l : layer i feature matrix dimension, Z: final structural feature matrix, S: known entity pair, a: adjacency matrix, d _s : dimension of structural matrix, W ^l : layer I parameter matrix, D _s : entity spacing under structural space, e: structural vector of entity e, d _t : name vector matrix dimension, P: initial structural feature matrix dimension, F: final structural feature matrix dimension, N: name vector matrix of entity, D: distance between entities, G ₁ : to-be-aligned knowledge map 1,G ₂ : to-be-aligned knowledge map 2,e ₁ ：G ₁ Middle entity, e ₂ ：G ₂ Middle entity, Δ ₁ ：e ₁ Difference between the nearest two entities, Δ ₂ ：e ₂ Difference between the nearest two entities, θ ₁ : distance difference threshold, theta ₂ : a threshold number of newly added entity pairs.

A formalized description of the entity alignment problem is given by two knowledge graphs, G ₁ ＝(E ₁ ,R ₁ ,T ₁ ) And G ₂ ＝(E ₂ ,R ₂ ,T ₂ ) Wherein E represents an entity, R represents a relationship,

representing triplets in the map. The known entity pair is represented as

Wherein equal signs indicate that two entities point to the same entityA world entity. Given an entity, the process of finding its corresponding entity in another knowledge-graph can be considered a ranking problem. That is, under a certain feature space, the degree of similarity (distance) of a given entity to all entities in another knowledge-graph is calculated and given an ordering, and the entity with the highest degree of similarity (distance is the smallest) can be regarded as an alignment result.

Taking the medical knowledge map as an example, in order to obtain more medical knowledge, a plurality of independent medical knowledge maps can be fused, and in order to better fuse the medical knowledge map, entities in the medical knowledge map need to be identified, wherein the entities include names of medicines, names of diseases and names of symptoms. The three types of entities are the most basic entities of the medical knowledge graph, the alignment of the three types of entities is made, the most basic requirements of the medical knowledge graph are met, and the extraction of other entities can be determined according to actual needs.

The embodiment captures entity adjacency structure information and generates an entity structure representation vector by using a Graph Convolution Network (GCN). The GCN is a convolutional network that acts directly on the graph structure data to generate corresponding node structure vectors by capturing the structure information around the nodes. The input of the GCN is a feature matrix of the entity

And the adjacency matrix a of the figure. The output is a feature matrix with structure information

N represents the number of nodes in the map, while P and F represent the dimensions of the input and output matrix features, respectively.

GCN models typically contain multiple GCN layers. In particular, assume that the input at layer I is a feature matrix of nodes

Wherein d is ^l Dimension representing the characteristic matrix of the l-th layer (for the first layer, H ¹ ＝X，d ¹ = P). The first layer output is

Wherein

I is an identity matrix and is a matrix of the identity,

is composed of

The diagonal matrix of (a).

Is a parameter matrix of the l-th layer, d ^l+1 Is the dimension of the next level feature matrix. The activation function σ is often set to ReLU. For the last layer, H ^l+1 ＝Z，d ^l+1 ＝F。

In this embodiment, two-layer GCNs are constructed, each GCN being configured to process a knowledge graph and generate a corresponding entity vector, where an initial feature matrix X is obtained by sampling from L2 regularized truncated normal distribution and is updated through training of each layer of the GCN, so as to fully capture structural information in the knowledge graph and generate an output feature matrix Z _s (P＝F＝d ^l ＝d _s ) And two GCNs share the feature matrix W in two layers ¹ And W ² 。

The entity structure vectors of different knowledge-graphs are not in the same space, so it is necessary to align them into the same space using a known entity pair S. A specific training objective is to minimize the following loss values:

wherein [ x ]] ₊ ＝max{0,x}，

The representation is based on a known entity pair (e) ₁ ,e ₂ ) A 1, e ₁ Or e ₂ Instead, a set of negative examples generated by the random entity. e represents the structure vector of entity e. Gamma represents the end distance separating the positive and negative samples. Model optimization was performed using a random gradient descent.

Given the final structural feature matrix Z, e ₁ ∈G ₁ And e ₂ ∈G ₂ A distance D below the structural space _s (e ₁ ,e ₂ )＝||e ₁ -e ₂ || _l1 /d _s

If only the structural features are considered, the distance D between the target entity e and the structural features _s The closest entity will be considered the corresponding entity of e.

Unlike the prior art, the present embodiment proposes to align with text features simultaneously. Specifically, in the text form of entity names, consider that 1) entity names are often used to identify entities and are widely available; 2) By comparing the entity names, whether the two entities are the same or not can be visually judged; 3) The method is not influenced by the scale of the training set and has stronger stability.

Although the conventional string comparison method can be used to measure the similarity between two entity names, the semantic similarity of the entity names is used in this embodiment because it is also applicable when the knowledge base is very different, such as the alignment of multi-language knowledge base. Specifically, the average word vector representation is used as an entity name vector because it is simple and general, and semantic information can be expressed without a special corpus in consideration of the simplicity and the universality. Suppose that entity e includes the word w in its name ₁ ,w ₂ ,...,w _p Then the entity name vector may be represented as the average of these word vectors, i.e.

Wherein w _i Is w _i The word vector of (2). The name vector for all entities can be represented as N.

Similar to word vectors, similar entity names will be very close in vector space. e.g. of the type ₁ ∈G ₁ And e ₂ ∈G ₂ Distance under text feature space is D _t (e ₁ ,e ₂ )＝||ne(e ₁ )-ne(e ₂ )|| _l1 /d _t . If only the entity is consideredName characteristics, and e distance D to target entity _t The closest entity will be considered as the corresponding entity to e. For cross-language entity alignment, pre-training cross-language word vectors can be utilized, thereby ensuring that cross-language entity name vectors are in the same space.

Considering that structural and name features depict entities from two different aspects, structural and semantic, respectively, they can be further combined to provide a more comprehensive alignment cue. In particular, two entities e ₁ ∈G ₁ And e ₂ ∈G ₂ The distance between them is:

D(e ₁ ,e ₂ )＝αD _s (e ₁ ,e ₂ )+(1-α)D _t (e ₁ ,e ₂ )

where α is the hyperparameter used to adjust the weights of the two features. In the space after the feature fusion, the entity closest to the target entity e by the distance D is regarded as the corresponding entity of e.

The word-shift distance model aims to measure the difference between different sentences, and represents that the embedded vectors of all words in one sentence need to be shifted to reach the minimum distance value of the embedded vectors of all words in another sentence. Compared with the distance between average word vectors, the word shift distance can better depict the influence of each word in the sentence on the whole sentence, and the semantic loss caused by average operation is avoided. However, this model is time consuming due to the need to compute word-level distances, and is not suitable for large-scale data. For this reason, the method is not used to calculate the distance between entity names from the beginning, but rather it is used to reorder the preamble results.

The main contributions of the present invention are as follows:

(2) And reordering the preorder alignment result by adopting a word shift distance model so as to fully mine entity name information and improve alignment accuracy.

Aiming at the problem that the structure information of the knowledge graph is deficient in a real world data set, the invention combines the entity name information which is not influenced by the degree of the entity node with the structure information to construct an entity alignment basic framework. And on the basis of the previous step, further mining entity name information by using a word shifting distance model, reordering the preorder results and further generating a final alignment result. The model works best on widely used entity aligned data sets.

The above examples are an implementation of the method for knowledge-graph fusion, but the implementation of the method is not limited by the examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be regarded as equivalent substitutions, and are included in the scope of the present invention.

Claims

1. A method for entity alignment of a knowledge graph, comprising the steps of:

step 1, acquiring data of two knowledge maps;

step 5, performing entity identification alignment according to the calculation result of the similarity degree to obtain a similar entity pair;

representing triplets in a graph, known entity pairs are represented as

entity e of two knowledge graphs in step 3 ₁ ∈G ₁ And e ₂ ∈G ₂ The structural distance is D in structural space _s (e ₁ ,e ₂ )＝||e ₁ -e ₂ || _l1 /d _s ，d _s Is the structural matrix dimension;

the word feature distance is the semantic similarity of the entity name, the average word vector is used as the entity name vector, and the distance of the entity name vector in the text feature space is calculated, specifically, the word feature distance is D _t (e ₁ ,e ₂ )＝||ne(e ₁ )-ne(e ₂ )|| _l1 /d _t Suppose that entity e includes the word w in its name ₁ ,w ₂ ,...,w _p Then the entity name vector may be represented as the average of these word vectors, i.e.

Whereinw _i Is w _i Word vector of d _t Is the name vector matrix dimension;

further, the word characteristic distance is calculated through a word moving distance model, the word moving distance model is used for measuring the difference among different sentences, and the word moving distance is represented as the minimum distance value of embedded vectors of all words in an entity which need to move to embedded vectors of all words in another entity;

the fusion formula of the comprehensive distance in the step 4 is as follows:

D(e ₁ ,e ₂ )＝αD _s (e ₁ ,e ₂ )+(1-α)D _t (e ₁ ,e ₂ )

where α is the hyperparameter used to adjust the weights of the two features.

2. The entity alignment method as claimed in claim 1, wherein the graph convolution networkThe input being a feature matrix of the entity

And an adjacency matrix A of the graph, the output being a feature matrix into which the structure information is fused

Wherein d is ^l Dimension representing the characteristic matrix of the l-th layer, for the first layer, H ¹ ＝X，d ¹ ＝P；

The first layer output is

Wherein

I is an identity matrix and is a matrix of the identity,

is composed of

The diagonal matrix of (a) is,

3. The entity alignment method according to claim 2, wherein the initial feature matrix X is sampled from the L2 regularized truncated normal distribution and updated by GCN layer trainingFurther, fully capturing structural information in the knowledge graph and generating an output characteristic matrix Z; the dimension of the feature matrix is always set to d _s ，P＝F＝d ^l ＝d _s While two GCNs share the feature matrix W in two layers ¹ And W ² 。

4. The entity alignment method of claim 3, wherein the training objective is to minimize the following penalty values:

wherein [ x ]] ₊ ＝max{0,x}，

The representation is based on a known entity pair (e) ₁ ,e ₂ ) A 1, e ₁ Or e ₂ Instead of a set of negative examples generated by a random entity,eand representing a structural vector of an entity e, and gamma represents an end distance separating positive and negative samples, and performing model optimization by adopting random gradient descent.