CN114090783A

CN114090783A - Heterogeneous knowledge graph fusion method and system

Info

Publication number: CN114090783A
Application number: CN202111202752.2A
Authority: CN
Inventors: 杨恺; 王亚沙; 赵俊峰; 单中原; 邹佩聂; 李瑞庆
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-02-25

Abstract

The invention relates to a heterogeneous knowledge graph fusion method and a system, aiming at the problems that in the prior art, different entities under the same concept are difficult to distinguish by single structural information, limited training data limits the accuracy of entity embedding expression learning based on a knowledge graph embedding method and the like, the invention provides the heterogeneous knowledge graph fusion method for fusing the structural information and the attribute information, and the invention has the effects that two kinds of information in a graph are fully utilized: and the entity structure and the entity attribute, a structure-based entity representation vector is obtained through a knowledge representation learning model, and the entity representation based on the entity attribute is learned through a twin neural network model based on a shared attention mechanism. And marking the best match found by iterating the two kinds of information each time, and supplementing the best match as new marking data into a training set, so that models of the two kinds of information are mutually assisted and are iteratively enhanced, and finally an entity alignment result with higher accuracy is obtained.

Description

Heterogeneous knowledge graph fusion method and system

Technical Field

The invention belongs to the technical field of knowledge maps, and particularly relates to a heterogeneous knowledge map fusion method and system.

Background

In recent years, knowledge maps have played an increasingly important role in technologies such as information retrieval, recommendation systems, and machine understanding. Different organizations build knowledge graphs based on their respective different needs and data sources, which results in knowledge graphs having different forms in the same domain. For example, an entity "/resource/health medical health center" in the wisdom city of the Weihai and a "health medical institution" in the medical knowledge graph of the Weihai both refer to a medical institution in the Weihai, but a knowledge graph constructed from different data would create two different example nodes. Many domain knowledge maps are generated and developed independently, and although entity naming conventions, representation forms and entity relationships may not be the same, the representation contents are consistent and may complement each other. Therefore, integrating different knowledge maps to form a larger, uniform and consistent knowledge form is important for knowledge reasoning and question answering. To fuse different knowledge-graphs, one of the primary tasks is to identify entities in different knowledge-graphs that represent the same real-world object, which is commonly referred to as an entity alignment problem. The entity alignment is an alignment process of entities in knowledge graphs from different sources, and is a primary step of knowledge graph fusion.

At present, the knowledge graph fusion technology based on entity alignment generally adopts the following methods:

the first method mainly performs entity alignment according to similarity of calculation entity names or attribute names, for example, patent publication No. CN113032516A (publication No. 20210625) discloses a "knowledge graph fusion method based on approximate ontology matching", which calculates entity similarity by using an approximate ontology matching method, and obtains a final entity alignment result by using a propagation method of the similarity on a graph. When the names of the two part maps are different greatly, the method is difficult to achieve accurate alignment effect.

The second method is to learn semantic information of each entity based on representation learning technology on the knowledge graph, calculate similarity of two graph entity representations, and find the most similar entity pair between the knowledge graphs to complete alignment work. The most typical method is to use the knowledge graph representation technology represented by TransE to calculate the vector representation of the entity in each knowledge graph, and use the pre-aligned 'seed entity' to calculate the transfer matrix between different knowledge graphs, and provide a cross-language spatial conversion mode for the embedded representation of each entity. For example, patent publication No. CN111191471A (publication No. 20200522) discloses a "knowledge graph fusion method based on entity sequence coding", which utilizes a RotatE model to learn the expression vectors of entities, and uses a bootmapping semi-supervised learning method to supplement a set of "seed entities" in each round of entity alignment iteration, and enlarge the scale of a model training set, so as to overcome the problem of "seed entity" sparsity in entity alignment and obtain a better entity alignment result.

A third approach is to improve the entity representation learning model to make it more suitable for entity alignment. For example, a patent with publication number CN110941722A (publication number 20200331) discloses "a knowledge graph fusion method based on entity alignment", which uses a GCN model to learn a representation vector of an entity in a knowledge graph, and considers multi-hop neighbor information and paths, so that the entity alignment process considers structure information with a larger sensing range.

Generally, the method only utilizes the structure information of the entity to complete the entity alignment task, and the method is mostly applied to the cross-language entity alignment. In the same language knowledge graph, different entities under the same concept are difficult to distinguish by simple structure information, and the entities often have the same connection relation or even are connected with the same entity. However, entities having similar structural information may be distinguished by different entity attribute information. The entity attribute information of the knowledge-graph comprises attribute categories and specific attribute values. At present, entity alignment work by utilizing attribute information similarity is not suitable for a large-scale and complex knowledge graph in reality. In fact, different entity attribute categories also differ in importance to entity alignment. In summary, in entity alignment based on attribute information and structure information, there are still many problems to be solved in the utilization of entity attribute information.

When heterogeneous knowledge graph entities are aligned, the existing method has the problems that different entities under the same concept are difficult to distinguish by single structural information, the alignment method is limited due to the fact that only entity attribute categories are considered and attribute values are not considered, the learning accuracy of entity embedding representation based on the knowledge graph embedding method is limited by limited training data, and the like.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a heterogeneous knowledge graph entity alignment method fusing structural information and attribute information, which makes full use of the structural information and the attribute information, makes mutual gain collaborative learning of the two parts of information, finds out entities with the same semantics in knowledge graphs in two fields, and improves the accuracy of heterogeneous knowledge graph entity alignment.

In order to achieve the above purposes, the invention adopts the technical scheme that:

a heterogeneous knowledge graph fusion method comprises the following steps:

s1, arranging the data of the knowledge graph to be matched, and forming a knowledge graph training set by the knowledge graph independently generated by various data with different sources and different structures, wherein the knowledge graph training set comprises a structure training set and an entity attribute training set;

s2, learning entity expression vectors from knowledge map structure information and knowledge map attribute information required by matching learning entities in the deep learning model through a knowledge map expression learning technology;

s3, calculating the similarity between the entities in the two domain knowledge graphs according to the entity expression vectors, wherein the similarity calculation method is based on Euclidean distance or Cosine distance;

and S4, finding the best match among the entities according to the calculated entity similarity, wherein the best matching entities are entity matching sets, namely an entity matching set based on the structure information and an entity matching set based on the attribute information.

Further, step S1 is preceded by step

S11: generating an initial training set from the aligned atlas data sets;

s12: setting a plurality of iteration times through a structure embedding learning model, and learning the embedding expression of the entity structure;

s13: aligning the entities and supplementing the best matching entity set with higher reliability into a training set;

s14: embedding the training attributes into the learning model by using the supplemented training set, setting a plurality of iteration times, aligning the entity and further supplementing the training set for continuously training the structure embedded model.

Further, step S4 further includes supplementing the initial training set in step S11 with the set of best matching entities with higher reliability, and supplementing the size of the initial training set, so that the structure information and the attribute information can be mutually promoted in the iterative training.

Further, the complementary set of best matching entities contains only positive samples and does not use the nearest neighbor approach to sample negative samples.

Further, in step S2, learning the structure information, learning entity triples by using knowledge representation learning technology (such as a TransR model), and representing the entities of the domain knowledge graph as vectors containing the structure information;

the attribute information of the entity comprises attribute categories and corresponding attribute values of the entity, and the twin neural network model based on the shared attention mechanism is adopted for learning the attribute-based entity representation aiming at the attribute learning of the entity.

Further, aiming at entity structure information learning, based on a limit loss function, the method adopts an objective function O_TSR

Wherein gamma is₁，γ₂Two superparameters, the scores of the positive samples and the negative samples are limited, mu is more than 0 and represents the superparameters for balancing the negative samples, and f (tau) is less than or equal to gamma₁And f (τ'). gtoreq.gamma₂。

Further, the step of entity attribute information learning comprises establishing a twin network to process the attribute information of the two knowledge maps in the process of processing the entity attribute information; processing character strings of attribute values of different data structures, establishing a character-level attribute value representation learning model, and learning information of entity attribute values; a self-attentiveness mechanism is constructed to learn the importance of different entity attributes, and a shared attentiveness mechanism is proposed, such that the embedded representation of the attribute class shares the same attentiveness weight as the embedded representation of the attribute value.

Further, the attention weight is calculated as follows:

α_i＝softmax(Q^TW_aq_i)

wherein alpha is_iWeight representing ith attribute, Q representing attribute class matrix, Q_iRepresents the category of the ith attribute, and W is a weight matrix.

Further, finally, a Euclidean distance or a Cosine distance is used for calculating contrast Loss (contrast Loss) as an objective function, and a 'twin network' Loss function of the entity attribute adopts a contrast function:

wherein Y is_iAlignment tag representing the ith sample, entity x of two knowledge models₁，x₂Entity representation vectors in a "twin" network

Distance between two entity vectors:

attribute class vector

v_type＝∑_iα_iq_i(ii) a Vector of attribute values v_value＝∑_iα_iv_i。

Further, before supplementing the set of best matching entities with higher confidence into the initial training set in step S11, checking whether a conflict is caused, and assuming that two different candidate labels y and y' both match entity x in two iterations, selecting entity matching x with higher alignment similarity by:

if it is

Indicating that y entity is more likely to match entity x, whereas y' is more likely to match entity x.

A heterogeneous knowledge-graph fusion system comprises

The data preprocessing module is used for sorting the data of the knowledge graph to be matched and forming a knowledge graph training set comprising a structure training set and an entity attribute training set from the knowledge graphs independently generated by the data with different sources and different structures;

the deep learning module is used for learning the knowledge map structure information and the knowledge map attribute information required by matching the entity from the deep learning model through a knowledge map representation learning technology, and learning the entity representation vector from the knowledge map structure information and the knowledge map attribute information;

the similarity calculation module is used for calculating the similarity between the entities in the two domain knowledge maps according to the entity expression vector, and the similarity calculation method is based on Euclidean distance or Cosine distance;

and the entity alignment module is used for finding the best matching between the entities according to the calculated entity similarity, wherein the best matching entities are entity matching sets, namely an entity matching set based on the structure information and an entity matching set based on the attribute information.

Further, the data preprocessing module is further configured to generate an initial training set from the aligned atlas data sets;

the deep learning module is used for setting a plurality of iteration times through a structure embedding learning model and learning the embedding expression of the entity structure;

the entity alignment module is used for aligning entities and supplementing the best matching entity set with higher reliability into a training set;

the deep learning module is also used for embedding the training attributes into the learning model by using the supplemented training set, setting a plurality of iteration times, aligning the entity and further supplementing the training set, and is used for continuing to train the structure embedded model.

Further, the best matching entity set supplemented into the training set only contains positive samples, and the nearest neighbor method is not used for sampling the negative samples.

Further, aiming at the structural information learning, the entity triples are learned by using a knowledge representation learning technology (such as a TransR model), and the entities of the domain knowledge graph are represented as vectors containing structural information;

Further, the entity attribute information learning comprises establishing a twin network to process the attribute information of the two knowledge maps in the process of processing the entity attribute information; processing character strings of attribute values of different data structures, establishing a character-level attribute value representation learning model, and learning information of entity attribute values; establishing an importance degree of learning different entity attributes by a self-attention mechanism, and proposing a shared attention mechanism, so that the embedded representation of the attribute category and the embedded representation of the attribute value share the same attention weight, further, the sample sampling module simultaneously maintains the unlabeled data and the labeled data, and samples valuable samples from the unlabeled data in each iteration and are labeled by a domain expert.

The invention has the following effects: the heterogeneous knowledge graph fusion method and system provided by the invention fully utilize two kinds of information in the knowledge graph: and the entity structure and the entity attribute, a structure-based entity representation vector is obtained through a knowledge representation learning model, and the entity representation based on the entity attribute is learned through a twin neural network model based on a shared attention mechanism. And marking the best match found by iterating the two kinds of information each time, and supplementing the best match as new marking data into a training set, so that models of the two kinds of information are mutually assisted and are iteratively enhanced, and finally an entity alignment result with higher accuracy is obtained.

Drawings

FIG. 1 is a flow chart of a heterogeneous knowledge graph fusion method according to the present invention;

FIG. 2 is a block diagram of a heterogeneous knowledge-graph fusion system according to the present invention;

fig. 3 is a schematic diagram of an attribute representation learning method based on a shared attention mechanism in the heterogeneous knowledge-graph fusion system according to the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

The first embodiment is as follows:

as shown in fig. 1, a heterogeneous knowledge graph fusion method includes the following steps:

and S1, arranging the knowledge maps to be matched, and forming the input of the method by using the knowledge maps independently generated by various data with different sources and different structures. Various incidence relations, namely structure information, exist among the entities, and meanwhile, the entities contain various attribute information, so that a knowledge graph structure training set and an entity attribute training set are generated.

And S2, entity representation learning, wherein the information required by entity matching is learned from the deep learning model through a knowledge graph representation learning technology. Learning is mainly divided into two categories of information: knowledge graph structure information and knowledge graph attribute information from which entity representation vectors are learned.

And S3, calculating the similarity between the entities in the two domain knowledge graphs according to the entity expression vectors, wherein a similarity calculation method based on Euclidean distance or Cosine distance is generally adopted.

And S4, aligning entity matching, and finding the best matching between the entities according to the calculated entity similarity. Since entity alignment is often a one-to-one match, this step finds the best matching set of all entities to be matched, i.e. the entity matching set based on the structure information and the entity matching set based on the attribute information, according to the similarity.

The entity alignment task usually needs to label a small part of labeled data set in advance, and label whether two entities to be matched in two different source domain knowledge maps are the same entity with consistent semantics. How to learn the structural mode and the attribute information in the limited annotation data set as much as possible so as to find out the residual unaligned entities. The existing method is inclined to map entity representations obtained by structural information and attribute information to the same space, and if one kind of information of the domain knowledge graph is rare or missing, the learning of another kind of information is seriously influenced. Therefore, the invention adopts a semi-supervised learning framework of Co-learning (Co-Training), respectively establishes entity representation learning models for relatively independent structural information and attribute information, and mutually promotes and iteratively enhances the two types of information by means of supplementing a Training set, thereby improving the final entity alignment effect.

Firstly, generating an initial training set by an original atlas data set aligned in advance; then, a plurality of iteration times are set through a structural embedding learning model, the embedding of a learning entity represents and aligns the entity, and the entity pair with higher credibility is supplemented into a training set; and then, training the attribute embedded model by using the supplemented training set, similarly setting a plurality of iteration times, aligning the entity and further supplementing the training set for continuously training the structure embedded model. Thus, the two models are alternately and iteratively trained, and the training set is supplemented by using the model result, so that the complementary and mutual enhancement effect of the two kinds of information in entity alignment is realized.

In step S2, the entity representation learning is to learn information required for entity matching from the model by machine learning technology, and is mainly classified into two types of information, namely knowledge graph structure information and attribute information, where:

aiming at the structural information learning, the entity triples are learned by using a knowledge representation learning technology (such as a TransR model), and the entities of the domain knowledge graph are represented into vectors containing structural information; the TransR model captures diversified relations in the knowledge model to represent the implicit structural semantics of the knowledge model, and the heterogeneity of symbols in the knowledge model is eliminated.

For entity attribute learning, attribute information of an entity describes specific information of the entity itself, including attribute categories and corresponding attribute values of the entity. But the utilization of attribute information has significant challenges, e.g., different domain knowledge graphs may have different attribute categories, even if the same attribute has a different attribute name; in addition, the attribute values may have different data structures and data granularities. The invention proposes a twin neural network model based on a shared attention mechanism to learn attribute-based entity representations. In the process of processing the attribute information of the entity, a twin network is established to process the attribute information of the two knowledge maps; and performing character string processing on the attribute values of different data structures, establishing a character-level attribute value representation learning model, and learning information of entity attribute values. The importance degree of different entity attributes is learned by constructing a self-attention mechanism, and a shared attention mechanism is proposed, so that the embedded representation of the attribute category and the embedded representation of the attribute value share the same attention weight, and the accuracy of attribute information entity representation learning is improved.

The model can effectively learn the entity representation vectors of the attribute categories and the attribute values, measure the importance degrees of different attributes to the entity alignment task by using attention weights, and ensure the consistency of the importance degrees of the same attribute category and the attribute values in the learning process, thereby obtaining more accurate entity attribute representation.

In the step S3, the calculation of the entity similarity is to calculate the similarity between the entities in the knowledge graphs of the two domains according to the entity representation obtained by the representation learning layer, and here, the similarity calculation method based on the euclidean distance or the Cosine distance is adopted to calculate the entity similarity corresponding to the knowledge graphs to be fused, respectively.

The aligned entity matching in step S4 is to find the best match between entities according to the calculated entity similarity. Since entity alignment is often a one-to-one match, the best matching set of all entities to be matched, i.e. the entity matching set based on the structure information and the entity matching set based on the attribute information, is found here according to the similarity. The entity matching set can be used as a temporary entity alignment result and added into the training set again to supplement the scale of the training set, so that the structural information and the attribute information can be mutually promoted in iterative training, and a stable alignment result is finally achieved.

And integrating the matching results of the entity structure information and the attribute information, and integrating the results obtained by the two kinds of information to obtain a final entity alignment result. During each entity alignment iteration, the marked aligned entities are supplemented as a training set into the original training set. The method of aligning entities to supplement the training set is still a method of generating a new sample using "seed entities" to generate a supplementary data set. The 'seed entity' refers to a map entity aligned in advance, in the process, the situation that entity alignment results of different iterative processes are mutually contradictory can occur, and a supplemented training set can be edited by comparing two contradictory results and reserving an aligned entity with high similarity. Because the complementary training set cannot guarantee complete correctness, the complementary training set only contains positive samples, and a nearest neighbor method is not used for sampling the negative samples. And the field knowledge graph is repeatedly calculated, and finally the fusion of the field knowledge graph is completed, so that the continuous evolution and expansion of the field knowledge graph are realized.

Example two

As shown in fig. 2, the heterogeneous knowledge-graph fusion system according to the present invention includes a representation learning layer, an entity similarity calculation layer, and an entity matching layer. The domain knowledge models to be fused are used as input, each domain knowledge model comprises knowledge entities which can be regarded as entities to be matched, various incidence relations, namely structural information, exist among the entities, and the entities comprise various attribute information.

Structural information for the knowledge-graph: given entity triplet<h，r，t>TransR requires an embedded representation of the head entity h and the tail entity t, by M_rProjecting to the relation r to obtain h_rAnd t_rI.e. h_r＝hM_r，t_r＝tM_rVector space should be close to target h_r+r≈t_r. Such embedded representation learning models aim to preserve structural information of knowledge model entities, i.e. entities sharing similar neighbor structures in the knowledge model should have similar representations in the embedding space. Thus, TransR minimizes a threshold-based objective function J_TSR：

Where τ denotes the positive samples (h, r, t), and τ' is the negative samples generated from the positive samples.

And

representing a positive sample and a negative sample data set.

I.e. the scoring function defined in the triplet. The TransR model optimizes this threshold-based loss function so that positive sample triples score less than negative sample triples. However, such a loss function cannot ensure that the fraction of positive triples is absolutely low. For the entity alignment task, lower absolute score positive sample triples help reduce the "semantic drift" phenomenon embedded in a single space and better capture the semantics of different knowledge models. Therefore, based on the limiting loss function, the method proposes a new objective function, which is O_TSRRepresents:

γ₁，γ₂are two superparameters, limiting the fraction of positive and negative examples, μ > 0 indicating a balance negative example superparameters. The proposed objective function has two expectations: positive sample triplets have a lower score, while negative sample triplets have a higher score, i.e., f (τ) ≦ γ₁And f (τ'). gtoreq.gamma₂. Therefore, the method can directly control the absolute scores of the positive and negative triples according to needs, and still retains the characteristic based on marginal ordering loss.

As shown in fig. 2, for knowledge-graph attribute information: due to the fact that the knowledge model is complex and contains noisy attribute information, the invention provides an attribute representation learning technology based on a shared attention mechanism, a twin neural Network (Simese Network) is used for modeling entity embedded representations of two models, and an entity embedded vector of each model is obtained by splicing an attribute category vector and an attribute value vector. The method uses GRU (gated Current Unit) to encode attribute value character information, which can output an attribute value representation given an attribute value character vector.

The attribute-based entity representation vector contains attribute class representations and attribute value representations. Given two entities x₁，x₂Each entity has a set of attribute classes { q }₁，q₂，...，q_MAnd a corresponding set of attribute values v₁，v₂，...，v_M}. An entity may have M attributes, each of which is of different importance to the entity alignment process. Therefore, the present item learns a character-level representation vector for each character, and learns weights of different attributes using a self-attention mechanism, the entity attribute class representation vector, and the attribute value representation vector. And simultaneously, a shared attention mechanism is proposed, so that the attribute categories and attribute values of the same attribute share an attention weight, and the calculation mode of the weight is as follows:

α_i＝softmax(Q^TW_aq_i) (3)

wherein alpha is_iWeight representing ith attribute, Q representing attribute class matrix, Q_iRepresents the category of the ith attribute, and W is a weight matrix. The attribute categories and the corresponding attribute values of the same entity should share the same attention, and the attention sharing mechanism is to learn the attribute category attention through an attribute category vector and share the attribute category attention to the corresponding attribute value vector. The attribute class vector and the attribute value vector are represented as follows:

v_type＝∑_iα_iq_i；v_value＝∑_iα_iv_i (4)

the attribute-based entity vector representation is obtained by splicing attribute category representation and attribute value representation, namely the entity vector

The two domain knowledge models needing to be aligned adopt the same attribute learning model, namely a twin network mechanism, and finally, a Euclidean distance or a Cosine distance is used for calculating contrast Loss (contrast Loss) as an objective function. Entity x of two knowledge models₁，x₂Computing entity representation vectors in a "twin" network

The method defines the distance between two entity vectors:

the 'twin network' loss function of the entity attribute adopts a contrast function:

wherein Y is_iAlignment label representing the ith sample.

And learning the representation of the entity attributes and the attention weights by using the aligned entity pairs as a training set of the model, and respectively learning entity vectors based on the attributes of the entity pairs to be matched so as to prepare for calculating the similarity between the entities.

As shown in fig. 3, the detailed block diagram of the heterogeneous knowledge-graph fusion system according to the present invention includes a training set generation module, a learning module, an entity alignment module, and a training set supplement module, where the training set generation module is configured to generate an initial training set according to data of an original knowledge graph; the learning module is used for embedding a learning model through a structure, setting a plurality of iteration times and learning the embedded representation of an entity; the entity alignment module is used for aligning the entities, marking the entity alignment result and supplementing the entity pairs with higher credibility into the training set; the learning module is also used for embedding the learning model by using the training attributes of the supplemented training set, setting a plurality of iteration times, aligning the entity and further supplementing the training set, and is used for continuing to train the structure to be embedded into the learning model. Thus, the two models are alternately and iteratively trained, and the training set is supplemented by using the model result, so that the complementary and mutual enhancement effect of the two kinds of information in entity alignment is realized.

And the training set generation module generates a knowledge graph structure training set and an entity attribute training set according to the original knowledge graph data set. The positive sample of the training set not only contains the triples of the two knowledge graphs, but also generates new triples by using the 'seed entity pairs' aligned in advance through a method of exchanging entities, thereby bridging the entities or attributes of the two knowledge graphs and enabling the model to learn a uniform embedded representation space on the two knowledge graphs. Given the pair of entities (x, y) ∈ P that achieve alignment, the exchange entity method may generate a new training set:

wherein

To know

Representation knowledge graph G₁And G₂The positive sample data set of (1). Overall, the total set of positive samples generated

Generation of negative examples in training set using nearest neighbor negative sampling technique

In (1) generation

This is because the collaborative learning framework supplements the labeled alignment results in an incremental manner, and the entity alignment results obtained from each iteration generally do not have a high confidence. Therefore, error information is easy to appear in the marking process, training of the embedded representation model is misled, and the accuracy of the alignment result of the subsequent entity is reduced, and the phenomenon is the problem of semantic drift. Therefore, there is a need for an efficient negative sample sampling method that improves the accuracy of entity representation learning. Traditional negative sample

The sampling method is to uniformly and randomly sample the negative sample triplets. Given a triplet

It replaces h or t with any entity. However, a sample replaced in this way is relatively easily distinguished from its positive sample. The method is based on the fact that the knowledge model embedded representation model of entity alignment can distinguish two similar (a positive sample and a negative sample) triples.

Given an entity x needing replacement, unlike a random sampling method from all entities, the project selects an entity with higher entity similarity as a sampling candidate set. Specifically, the method searches the neighbors of the entity x by using Euclidean distance between embedded vectors, selects the similar neighbor entities in the embedded space as a candidate set, and carries out negative samplingSampling, where the number of candidate sets

Is a scale and N is the number of entities in the knowledge model. The negative sample sampling mode can ensure that an entity with lower similarity to the entity x cannot be sampled; meanwhile, the sampled negative samples retain similar characteristics (such as attribute types and entity relations) with the positive samples, so that the learning of the entity embedded vector is facilitated.

And the entity alignment module is used for calculating the distance between every two entities of the two knowledge models according to the trained entity embedding vector to obtain the similarity, wherein the distance can be Euclidean distance or Cosine distance. In the case of setting the threshold γ, the entity x ∈ G₁Possibly with many y e G₂And (7) corresponding. However, in this task, the entity alignment only considers the 1-to-1 case, so the constraint Σ is needed_x′∈Xφt(x′，y)≤1，∑_y′∈Yφ^t(x, y') is less than or equal to 1, wherein phi^t(..) indicates whether two entities are marked to be aligned in t iterations, 1 indicates alignment and 0 indicates non-alignment. With these two constraints, the problem is transformed into a maximum bipartite matching problem with weights, which is solved using a bipartite matching algorithm. Newly marked alignment entities inevitably collide during different iterations, and in order to improve marking quality and satisfy an alignment constraint, a marked alignment entity may be relabeled or become unmarked in subsequent iterations. This project uses simple but effective editing techniques to implement this approach:

before adding a newly marked alignment entity, it is checked whether a conflict results. Considering the case where an entity generates conflicting labels in different iterations, the method may wish to select an entity matching x that provides a higher degree of alignment similarity, assuming that there are two different candidate labels y and y' both matching entity x in the two iterations. Formally, the method computes the following likelihood differences:

if it is

Indicating that the y entity is more able to match the entity x. Otherwise, y' is more appropriately matched to entity x. Therefore, the matching accuracy is ensured to be higher and more stable in the continuous iteration process.

It can be seen from the above embodiments that, the heterogeneous knowledge graph fusion method and system disclosed by the present invention solve the problems in the prior art that different entities under the same concept are difficult to distinguish by single structural information and limited training data limit the entity embedding expression learning accuracy based on the knowledge graph embedding method, and the like, and the present invention provides a heterogeneous knowledge graph fusion method fusing structural information and attribute information, and the present invention has the effects of making full use of two kinds of information in a graph: and the entity structure and the entity attribute, a structure-based entity representation vector is obtained through a knowledge representation learning model, and the entity representation based on the entity attribute is learned through a twin neural network model based on a shared attention mechanism. And marking the best match found by iterating the two kinds of information each time, and supplementing the best match as new marking data into a training set, so that models of the two kinds of information are mutually assisted and are iteratively enhanced, and finally an entity alignment result with higher accuracy is obtained.

The method and system of the present invention are not limited to the embodiments described in the detailed description, and those skilled in the art can derive other embodiments according to the technical solutions of the present invention, which also belong to the technical innovation scope of the present invention.

Claims

1. A heterogeneous knowledge graph fusion method comprises the following steps:

2. The heterogeneous knowledge-graph fusion method of claim 1, which is characterized in that: step S1 is preceded by the step

S11, generating an initial training set from the aligned atlas data sets;

3. The heterogeneous knowledge-graph fusion method of claim 2, which is characterized in that: step S4 further includes supplementing the initial training set in step S11 with the set of best matching entities with higher reliability, and supplementing the size of the initial training set, so that the structure information and the attribute information can be mutually promoted in the iterative training.

4. The heterogeneous knowledge-graph fusion method of claim 3, wherein: the complementary set of best matching entities contains only positive samples and does not use the nearest neighbor approach to sample negative samples.

5. The heterogeneous knowledge-graph fusion method of claim 1, which is characterized in that: in step S2, learning the structure information, learning entity triples by using knowledge representation learning technology, and representing the entities of the domain knowledge graph as vectors containing the structure information;

6. The heterogeneous knowledge-graph fusion method of claim 5, wherein: aiming at the entity structure information learning, based on the limit loss function, the method adopts an objective function O_TSR

7. The heterogeneous knowledge-graph fusion method of claim 3, wherein: step one, entity attribute information learning comprises the steps that in the process of processing entity attribute information, a twin network is established to process attribute information of two knowledge maps; processing character strings of attribute values of different data structures, establishing a character-level attribute value representation learning model, and learning information of entity attribute values; a self-attentiveness mechanism is constructed to learn the importance of different entity attributes, and a shared attentiveness mechanism is proposed, such that the embedded representation of the attribute class shares the same attentiveness weight as the embedded representation of the attribute value.

8. The heterogeneous knowledge-graph fusion method of claim 7, wherein: the attention weight is calculated as follows:

α_i＝softmax(Q^TW_aq_i)

9. The heterogeneous knowledge-graph fusion method of claim 8, wherein: and finally, calculating contrast loss by using the Euclidean distance or the Cosine distance as an objective function, wherein the 'twin network' loss function of the entity attribute adopts a contrast function:

Distance between two entity vectors:

attribute category vector v_type＝∑_iα_iq_i(ii) a Vector of attribute values v_value＝∑_iα_iv_i。

10. The heterogeneous knowledge-graph fusion method of claim 9, wherein: before supplementing the initial training set in step S11 with the most reliable best matching entity set, checking whether a conflict is caused, and selecting the entity matching x with higher alignment similarity by the following method, assuming that two different candidate labels y and y' both match the entity x in two iterations:

if it is

11. A heterogeneous knowledge-graph fusion system is characterized in that: the system comprises

12. The heterogeneous knowledge-graph fusion system of claim 11 wherein: the data preprocessing module is further used for generating an initial training set from the aligned atlas data sets;

13. The heterogeneous knowledge-graph fusion system of claim 12 wherein: the set of best matching entities supplemented into the training set contains only positive samples and does not use the nearest neighbor method to sample negative samples.

14. The heterogeneous knowledge-graph fusion system of claim 11 wherein: aiming at the structural information learning, the entity triples are learned by using a knowledge representation learning technology, and the entities of the domain knowledge graph are represented as vectors containing structural information;