CN114090783A - Heterogeneous knowledge graph fusion method and system - Google Patents

Heterogeneous knowledge graph fusion method and system Download PDF

Info

Publication number
CN114090783A
CN114090783A CN202111202752.2A CN202111202752A CN114090783A CN 114090783 A CN114090783 A CN 114090783A CN 202111202752 A CN202111202752 A CN 202111202752A CN 114090783 A CN114090783 A CN 114090783A
Authority
CN
China
Prior art keywords
entity
knowledge
attribute
learning
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111202752.2A
Other languages
Chinese (zh)
Inventor
杨恺
王亚沙
赵俊峰
单中原
邹佩聂
李瑞庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202111202752.2A priority Critical patent/CN114090783A/en
Publication of CN114090783A publication Critical patent/CN114090783A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a heterogeneous knowledge graph fusion method and a system, aiming at the problems that in the prior art, different entities under the same concept are difficult to distinguish by single structural information, limited training data limits the accuracy of entity embedding expression learning based on a knowledge graph embedding method and the like, the invention provides the heterogeneous knowledge graph fusion method for fusing the structural information and the attribute information, and the invention has the effects that two kinds of information in a graph are fully utilized: and the entity structure and the entity attribute, a structure-based entity representation vector is obtained through a knowledge representation learning model, and the entity representation based on the entity attribute is learned through a twin neural network model based on a shared attention mechanism. And marking the best match found by iterating the two kinds of information each time, and supplementing the best match as new marking data into a training set, so that models of the two kinds of information are mutually assisted and are iteratively enhanced, and finally an entity alignment result with higher accuracy is obtained.

Description

Heterogeneous knowledge graph fusion method and system
Technical Field
The invention belongs to the technical field of knowledge maps, and particularly relates to a heterogeneous knowledge map fusion method and system.
Background
In recent years, knowledge maps have played an increasingly important role in technologies such as information retrieval, recommendation systems, and machine understanding. Different organizations build knowledge graphs based on their respective different needs and data sources, which results in knowledge graphs having different forms in the same domain. For example, an entity "/resource/health medical health center" in the wisdom city of the Weihai and a "health medical institution" in the medical knowledge graph of the Weihai both refer to a medical institution in the Weihai, but a knowledge graph constructed from different data would create two different example nodes. Many domain knowledge maps are generated and developed independently, and although entity naming conventions, representation forms and entity relationships may not be the same, the representation contents are consistent and may complement each other. Therefore, integrating different knowledge maps to form a larger, uniform and consistent knowledge form is important for knowledge reasoning and question answering. To fuse different knowledge-graphs, one of the primary tasks is to identify entities in different knowledge-graphs that represent the same real-world object, which is commonly referred to as an entity alignment problem. The entity alignment is an alignment process of entities in knowledge graphs from different sources, and is a primary step of knowledge graph fusion.
At present, the knowledge graph fusion technology based on entity alignment generally adopts the following methods:
the first method mainly performs entity alignment according to similarity of calculation entity names or attribute names, for example, patent publication No. CN113032516A (publication No. 20210625) discloses a "knowledge graph fusion method based on approximate ontology matching", which calculates entity similarity by using an approximate ontology matching method, and obtains a final entity alignment result by using a propagation method of the similarity on a graph. When the names of the two part maps are different greatly, the method is difficult to achieve accurate alignment effect.
The second method is to learn semantic information of each entity based on representation learning technology on the knowledge graph, calculate similarity of two graph entity representations, and find the most similar entity pair between the knowledge graphs to complete alignment work. The most typical method is to use the knowledge graph representation technology represented by TransE to calculate the vector representation of the entity in each knowledge graph, and use the pre-aligned 'seed entity' to calculate the transfer matrix between different knowledge graphs, and provide a cross-language spatial conversion mode for the embedded representation of each entity. For example, patent publication No. CN111191471A (publication No. 20200522) discloses a "knowledge graph fusion method based on entity sequence coding", which utilizes a RotatE model to learn the expression vectors of entities, and uses a bootmapping semi-supervised learning method to supplement a set of "seed entities" in each round of entity alignment iteration, and enlarge the scale of a model training set, so as to overcome the problem of "seed entity" sparsity in entity alignment and obtain a better entity alignment result.
A third approach is to improve the entity representation learning model to make it more suitable for entity alignment. For example, a patent with publication number CN110941722A (publication number 20200331) discloses "a knowledge graph fusion method based on entity alignment", which uses a GCN model to learn a representation vector of an entity in a knowledge graph, and considers multi-hop neighbor information and paths, so that the entity alignment process considers structure information with a larger sensing range.
Generally, the method only utilizes the structure information of the entity to complete the entity alignment task, and the method is mostly applied to the cross-language entity alignment. In the same language knowledge graph, different entities under the same concept are difficult to distinguish by simple structure information, and the entities often have the same connection relation or even are connected with the same entity. However, entities having similar structural information may be distinguished by different entity attribute information. The entity attribute information of the knowledge-graph comprises attribute categories and specific attribute values. At present, entity alignment work by utilizing attribute information similarity is not suitable for a large-scale and complex knowledge graph in reality. In fact, different entity attribute categories also differ in importance to entity alignment. In summary, in entity alignment based on attribute information and structure information, there are still many problems to be solved in the utilization of entity attribute information.
When heterogeneous knowledge graph entities are aligned, the existing method has the problems that different entities under the same concept are difficult to distinguish by single structural information, the alignment method is limited due to the fact that only entity attribute categories are considered and attribute values are not considered, the learning accuracy of entity embedding representation based on the knowledge graph embedding method is limited by limited training data, and the like.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a heterogeneous knowledge graph entity alignment method fusing structural information and attribute information, which makes full use of the structural information and the attribute information, makes mutual gain collaborative learning of the two parts of information, finds out entities with the same semantics in knowledge graphs in two fields, and improves the accuracy of heterogeneous knowledge graph entity alignment.
In order to achieve the above purposes, the invention adopts the technical scheme that:
a heterogeneous knowledge graph fusion method comprises the following steps:
s1, arranging the data of the knowledge graph to be matched, and forming a knowledge graph training set by the knowledge graph independently generated by various data with different sources and different structures, wherein the knowledge graph training set comprises a structure training set and an entity attribute training set;
s2, learning entity expression vectors from knowledge map structure information and knowledge map attribute information required by matching learning entities in the deep learning model through a knowledge map expression learning technology;
s3, calculating the similarity between the entities in the two domain knowledge graphs according to the entity expression vectors, wherein the similarity calculation method is based on Euclidean distance or Cosine distance;
and S4, finding the best match among the entities according to the calculated entity similarity, wherein the best matching entities are entity matching sets, namely an entity matching set based on the structure information and an entity matching set based on the attribute information.
Further, step S1 is preceded by step
S11: generating an initial training set from the aligned atlas data sets;
s12: setting a plurality of iteration times through a structure embedding learning model, and learning the embedding expression of the entity structure;
s13: aligning the entities and supplementing the best matching entity set with higher reliability into a training set;
s14: embedding the training attributes into the learning model by using the supplemented training set, setting a plurality of iteration times, aligning the entity and further supplementing the training set for continuously training the structure embedded model.
Further, step S4 further includes supplementing the initial training set in step S11 with the set of best matching entities with higher reliability, and supplementing the size of the initial training set, so that the structure information and the attribute information can be mutually promoted in the iterative training.
Further, the complementary set of best matching entities contains only positive samples and does not use the nearest neighbor approach to sample negative samples.
Further, in step S2, learning the structure information, learning entity triples by using knowledge representation learning technology (such as a TransR model), and representing the entities of the domain knowledge graph as vectors containing the structure information;
the attribute information of the entity comprises attribute categories and corresponding attribute values of the entity, and the twin neural network model based on the shared attention mechanism is adopted for learning the attribute-based entity representation aiming at the attribute learning of the entity.
Further, aiming at entity structure information learning, based on a limit loss function, the method adopts an objective function OTSR
Figure BDA0003305637180000041
Wherein gamma is1,γ2Two superparameters, the scores of the positive samples and the negative samples are limited, mu is more than 0 and represents the superparameters for balancing the negative samples, and f (tau) is less than or equal to gamma1And f (τ'). gtoreq.gamma2
Further, the step of entity attribute information learning comprises establishing a twin network to process the attribute information of the two knowledge maps in the process of processing the entity attribute information; processing character strings of attribute values of different data structures, establishing a character-level attribute value representation learning model, and learning information of entity attribute values; a self-attentiveness mechanism is constructed to learn the importance of different entity attributes, and a shared attentiveness mechanism is proposed, such that the embedded representation of the attribute class shares the same attentiveness weight as the embedded representation of the attribute value.
Further, the attention weight is calculated as follows:
αi=softmax(QTWaqi)
wherein alpha isiWeight representing ith attribute, Q representing attribute class matrix, QiRepresents the category of the ith attribute, and W is a weight matrix.
Further, finally, a Euclidean distance or a Cosine distance is used for calculating contrast Loss (contrast Loss) as an objective function, and a 'twin network' Loss function of the entity attribute adopts a contrast function:
Figure BDA0003305637180000051
wherein Y isiAlignment tag representing the ith sample, entity x of two knowledge models1,x2Entity representation vectors in a "twin" network
Figure BDA0003305637180000052
Distance between two entity vectors:
Figure BDA0003305637180000053
attribute class vector
vtype=∑iαiqi(ii) a Vector of attribute values vvalue=∑iαivi
Further, before supplementing the set of best matching entities with higher confidence into the initial training set in step S11, checking whether a conflict is caused, and assuming that two different candidate labels y and y' both match entity x in two iterations, selecting entity matching x with higher alignment similarity by:
Figure BDA0003305637180000061
if it is
Figure BDA0003305637180000062
Indicating that y entity is more likely to match entity x, whereas y' is more likely to match entity x.
A heterogeneous knowledge-graph fusion system comprises
The data preprocessing module is used for sorting the data of the knowledge graph to be matched and forming a knowledge graph training set comprising a structure training set and an entity attribute training set from the knowledge graphs independently generated by the data with different sources and different structures;
the deep learning module is used for learning the knowledge map structure information and the knowledge map attribute information required by matching the entity from the deep learning model through a knowledge map representation learning technology, and learning the entity representation vector from the knowledge map structure information and the knowledge map attribute information;
the similarity calculation module is used for calculating the similarity between the entities in the two domain knowledge maps according to the entity expression vector, and the similarity calculation method is based on Euclidean distance or Cosine distance;
and the entity alignment module is used for finding the best matching between the entities according to the calculated entity similarity, wherein the best matching entities are entity matching sets, namely an entity matching set based on the structure information and an entity matching set based on the attribute information.
Further, the data preprocessing module is further configured to generate an initial training set from the aligned atlas data sets;
the deep learning module is used for setting a plurality of iteration times through a structure embedding learning model and learning the embedding expression of the entity structure;
the entity alignment module is used for aligning entities and supplementing the best matching entity set with higher reliability into a training set;
the deep learning module is also used for embedding the training attributes into the learning model by using the supplemented training set, setting a plurality of iteration times, aligning the entity and further supplementing the training set, and is used for continuing to train the structure embedded model.
Further, the best matching entity set supplemented into the training set only contains positive samples, and the nearest neighbor method is not used for sampling the negative samples.
Further, aiming at the structural information learning, the entity triples are learned by using a knowledge representation learning technology (such as a TransR model), and the entities of the domain knowledge graph are represented as vectors containing structural information;
the attribute information of the entity comprises attribute categories and corresponding attribute values of the entity, and the twin neural network model based on the shared attention mechanism is adopted for learning the attribute-based entity representation aiming at the attribute learning of the entity.
Further, the entity attribute information learning comprises establishing a twin network to process the attribute information of the two knowledge maps in the process of processing the entity attribute information; processing character strings of attribute values of different data structures, establishing a character-level attribute value representation learning model, and learning information of entity attribute values; establishing an importance degree of learning different entity attributes by a self-attention mechanism, and proposing a shared attention mechanism, so that the embedded representation of the attribute category and the embedded representation of the attribute value share the same attention weight, further, the sample sampling module simultaneously maintains the unlabeled data and the labeled data, and samples valuable samples from the unlabeled data in each iteration and are labeled by a domain expert.
The invention has the following effects: the heterogeneous knowledge graph fusion method and system provided by the invention fully utilize two kinds of information in the knowledge graph: and the entity structure and the entity attribute, a structure-based entity representation vector is obtained through a knowledge representation learning model, and the entity representation based on the entity attribute is learned through a twin neural network model based on a shared attention mechanism. And marking the best match found by iterating the two kinds of information each time, and supplementing the best match as new marking data into a training set, so that models of the two kinds of information are mutually assisted and are iteratively enhanced, and finally an entity alignment result with higher accuracy is obtained.
Drawings
FIG. 1 is a flow chart of a heterogeneous knowledge graph fusion method according to the present invention;
FIG. 2 is a block diagram of a heterogeneous knowledge-graph fusion system according to the present invention;
fig. 3 is a schematic diagram of an attribute representation learning method based on a shared attention mechanism in the heterogeneous knowledge-graph fusion system according to the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
The first embodiment is as follows:
as shown in fig. 1, a heterogeneous knowledge graph fusion method includes the following steps:
and S1, arranging the knowledge maps to be matched, and forming the input of the method by using the knowledge maps independently generated by various data with different sources and different structures. Various incidence relations, namely structure information, exist among the entities, and meanwhile, the entities contain various attribute information, so that a knowledge graph structure training set and an entity attribute training set are generated.
And S2, entity representation learning, wherein the information required by entity matching is learned from the deep learning model through a knowledge graph representation learning technology. Learning is mainly divided into two categories of information: knowledge graph structure information and knowledge graph attribute information from which entity representation vectors are learned.
And S3, calculating the similarity between the entities in the two domain knowledge graphs according to the entity expression vectors, wherein a similarity calculation method based on Euclidean distance or Cosine distance is generally adopted.
And S4, aligning entity matching, and finding the best matching between the entities according to the calculated entity similarity. Since entity alignment is often a one-to-one match, this step finds the best matching set of all entities to be matched, i.e. the entity matching set based on the structure information and the entity matching set based on the attribute information, according to the similarity.
The entity alignment task usually needs to label a small part of labeled data set in advance, and label whether two entities to be matched in two different source domain knowledge maps are the same entity with consistent semantics. How to learn the structural mode and the attribute information in the limited annotation data set as much as possible so as to find out the residual unaligned entities. The existing method is inclined to map entity representations obtained by structural information and attribute information to the same space, and if one kind of information of the domain knowledge graph is rare or missing, the learning of another kind of information is seriously influenced. Therefore, the invention adopts a semi-supervised learning framework of Co-learning (Co-Training), respectively establishes entity representation learning models for relatively independent structural information and attribute information, and mutually promotes and iteratively enhances the two types of information by means of supplementing a Training set, thereby improving the final entity alignment effect.
Firstly, generating an initial training set by an original atlas data set aligned in advance; then, a plurality of iteration times are set through a structural embedding learning model, the embedding of a learning entity represents and aligns the entity, and the entity pair with higher credibility is supplemented into a training set; and then, training the attribute embedded model by using the supplemented training set, similarly setting a plurality of iteration times, aligning the entity and further supplementing the training set for continuously training the structure embedded model. Thus, the two models are alternately and iteratively trained, and the training set is supplemented by using the model result, so that the complementary and mutual enhancement effect of the two kinds of information in entity alignment is realized.
In step S2, the entity representation learning is to learn information required for entity matching from the model by machine learning technology, and is mainly classified into two types of information, namely knowledge graph structure information and attribute information, where:
aiming at the structural information learning, the entity triples are learned by using a knowledge representation learning technology (such as a TransR model), and the entities of the domain knowledge graph are represented into vectors containing structural information; the TransR model captures diversified relations in the knowledge model to represent the implicit structural semantics of the knowledge model, and the heterogeneity of symbols in the knowledge model is eliminated.
For entity attribute learning, attribute information of an entity describes specific information of the entity itself, including attribute categories and corresponding attribute values of the entity. But the utilization of attribute information has significant challenges, e.g., different domain knowledge graphs may have different attribute categories, even if the same attribute has a different attribute name; in addition, the attribute values may have different data structures and data granularities. The invention proposes a twin neural network model based on a shared attention mechanism to learn attribute-based entity representations. In the process of processing the attribute information of the entity, a twin network is established to process the attribute information of the two knowledge maps; and performing character string processing on the attribute values of different data structures, establishing a character-level attribute value representation learning model, and learning information of entity attribute values. The importance degree of different entity attributes is learned by constructing a self-attention mechanism, and a shared attention mechanism is proposed, so that the embedded representation of the attribute category and the embedded representation of the attribute value share the same attention weight, and the accuracy of attribute information entity representation learning is improved.
The model can effectively learn the entity representation vectors of the attribute categories and the attribute values, measure the importance degrees of different attributes to the entity alignment task by using attention weights, and ensure the consistency of the importance degrees of the same attribute category and the attribute values in the learning process, thereby obtaining more accurate entity attribute representation.
In the step S3, the calculation of the entity similarity is to calculate the similarity between the entities in the knowledge graphs of the two domains according to the entity representation obtained by the representation learning layer, and here, the similarity calculation method based on the euclidean distance or the Cosine distance is adopted to calculate the entity similarity corresponding to the knowledge graphs to be fused, respectively.
The aligned entity matching in step S4 is to find the best match between entities according to the calculated entity similarity. Since entity alignment is often a one-to-one match, the best matching set of all entities to be matched, i.e. the entity matching set based on the structure information and the entity matching set based on the attribute information, is found here according to the similarity. The entity matching set can be used as a temporary entity alignment result and added into the training set again to supplement the scale of the training set, so that the structural information and the attribute information can be mutually promoted in iterative training, and a stable alignment result is finally achieved.
And integrating the matching results of the entity structure information and the attribute information, and integrating the results obtained by the two kinds of information to obtain a final entity alignment result. During each entity alignment iteration, the marked aligned entities are supplemented as a training set into the original training set. The method of aligning entities to supplement the training set is still a method of generating a new sample using "seed entities" to generate a supplementary data set. The 'seed entity' refers to a map entity aligned in advance, in the process, the situation that entity alignment results of different iterative processes are mutually contradictory can occur, and a supplemented training set can be edited by comparing two contradictory results and reserving an aligned entity with high similarity. Because the complementary training set cannot guarantee complete correctness, the complementary training set only contains positive samples, and a nearest neighbor method is not used for sampling the negative samples. And the field knowledge graph is repeatedly calculated, and finally the fusion of the field knowledge graph is completed, so that the continuous evolution and expansion of the field knowledge graph are realized.
Example two
As shown in fig. 2, the heterogeneous knowledge-graph fusion system according to the present invention includes a representation learning layer, an entity similarity calculation layer, and an entity matching layer. The domain knowledge models to be fused are used as input, each domain knowledge model comprises knowledge entities which can be regarded as entities to be matched, various incidence relations, namely structural information, exist among the entities, and the entities comprise various attribute information.
Structural information for the knowledge-graph: given entity triplet<h,r,t>TransR requires an embedded representation of the head entity h and the tail entity t, by MrProjecting to the relation r to obtain hrAnd trI.e. hr=hMr,tr=tMrVector space should be close to target hr+r≈tr. Such embedded representation learning models aim to preserve structural information of knowledge model entities, i.e. entities sharing similar neighbor structures in the knowledge model should have similar representations in the embedding space. Thus, TransR minimizes a threshold-based objective function JTSR
Figure BDA0003305637180000121
Where τ denotes the positive samples (h, r, t), and τ' is the negative samples generated from the positive samples.
Figure BDA0003305637180000124
And
Figure BDA0003305637180000125
representing a positive sample and a negative sample data set.
Figure BDA0003305637180000122
I.e. the scoring function defined in the triplet. The TransR model optimizes this threshold-based loss function so that positive sample triples score less than negative sample triples. However, such a loss function cannot ensure that the fraction of positive triples is absolutely low. For the entity alignment task, lower absolute score positive sample triples help reduce the "semantic drift" phenomenon embedded in a single space and better capture the semantics of different knowledge models. Therefore, based on the limiting loss function, the method proposes a new objective function, which is OTSRRepresents:
Figure BDA0003305637180000123
γ1,γ2are two superparameters, limiting the fraction of positive and negative examples, μ > 0 indicating a balance negative example superparameters. The proposed objective function has two expectations: positive sample triplets have a lower score, while negative sample triplets have a higher score, i.e., f (τ) ≦ γ1And f (τ'). gtoreq.gamma2. Therefore, the method can directly control the absolute scores of the positive and negative triples according to needs, and still retains the characteristic based on marginal ordering loss.
As shown in fig. 2, for knowledge-graph attribute information: due to the fact that the knowledge model is complex and contains noisy attribute information, the invention provides an attribute representation learning technology based on a shared attention mechanism, a twin neural Network (Simese Network) is used for modeling entity embedded representations of two models, and an entity embedded vector of each model is obtained by splicing an attribute category vector and an attribute value vector. The method uses GRU (gated Current Unit) to encode attribute value character information, which can output an attribute value representation given an attribute value character vector.
The attribute-based entity representation vector contains attribute class representations and attribute value representations. Given two entities x1,x2Each entity has a set of attribute classes { q }1,q2,...,qMAnd a corresponding set of attribute values v1,v2,...,vM}. An entity may have M attributes, each of which is of different importance to the entity alignment process. Therefore, the present item learns a character-level representation vector for each character, and learns weights of different attributes using a self-attention mechanism, the entity attribute class representation vector, and the attribute value representation vector. And simultaneously, a shared attention mechanism is proposed, so that the attribute categories and attribute values of the same attribute share an attention weight, and the calculation mode of the weight is as follows:
αi=softmax(QTWaqi) (3)
wherein alpha isiWeight representing ith attribute, Q representing attribute class matrix, QiRepresents the category of the ith attribute, and W is a weight matrix. The attribute categories and the corresponding attribute values of the same entity should share the same attention, and the attention sharing mechanism is to learn the attribute category attention through an attribute category vector and share the attribute category attention to the corresponding attribute value vector. The attribute class vector and the attribute value vector are represented as follows:
vtype=∑iαiqi;vvalue=∑iαivi (4)
the attribute-based entity vector representation is obtained by splicing attribute category representation and attribute value representation, namely the entity vector
Figure BDA0003305637180000131
The two domain knowledge models needing to be aligned adopt the same attribute learning model, namely a twin network mechanism, and finally, a Euclidean distance or a Cosine distance is used for calculating contrast Loss (contrast Loss) as an objective function. Entity x of two knowledge models1,x2Computing entity representation vectors in a "twin" network
Figure BDA0003305637180000141
The method defines the distance between two entity vectors:
Figure BDA0003305637180000142
the 'twin network' loss function of the entity attribute adopts a contrast function:
Figure BDA0003305637180000143
wherein Y isiAlignment label representing the ith sample.
And learning the representation of the entity attributes and the attention weights by using the aligned entity pairs as a training set of the model, and respectively learning entity vectors based on the attributes of the entity pairs to be matched so as to prepare for calculating the similarity between the entities.
As shown in fig. 3, the detailed block diagram of the heterogeneous knowledge-graph fusion system according to the present invention includes a training set generation module, a learning module, an entity alignment module, and a training set supplement module, where the training set generation module is configured to generate an initial training set according to data of an original knowledge graph; the learning module is used for embedding a learning model through a structure, setting a plurality of iteration times and learning the embedded representation of an entity; the entity alignment module is used for aligning the entities, marking the entity alignment result and supplementing the entity pairs with higher credibility into the training set; the learning module is also used for embedding the learning model by using the training attributes of the supplemented training set, setting a plurality of iteration times, aligning the entity and further supplementing the training set, and is used for continuing to train the structure to be embedded into the learning model. Thus, the two models are alternately and iteratively trained, and the training set is supplemented by using the model result, so that the complementary and mutual enhancement effect of the two kinds of information in entity alignment is realized.
And the training set generation module generates a knowledge graph structure training set and an entity attribute training set according to the original knowledge graph data set. The positive sample of the training set not only contains the triples of the two knowledge graphs, but also generates new triples by using the 'seed entity pairs' aligned in advance through a method of exchanging entities, thereby bridging the entities or attributes of the two knowledge graphs and enabling the model to learn a uniform embedded representation space on the two knowledge graphs. Given the pair of entities (x, y) ∈ P that achieve alignment, the exchange entity method may generate a new training set:
Figure BDA0003305637180000151
wherein
Figure BDA0003305637180000152
To know
Figure BDA0003305637180000153
Representation knowledge graph G1And G2The positive sample data set of (1). Overall, the total set of positive samples generated
Figure BDA0003305637180000154
Generation of negative examples in training set using nearest neighbor negative sampling technique
Figure BDA0003305637180000155
In (1) generation
Figure BDA0003305637180000156
This is because the collaborative learning framework supplements the labeled alignment results in an incremental manner, and the entity alignment results obtained from each iteration generally do not have a high confidence. Therefore, error information is easy to appear in the marking process, training of the embedded representation model is misled, and the accuracy of the alignment result of the subsequent entity is reduced, and the phenomenon is the problem of semantic drift. Therefore, there is a need for an efficient negative sample sampling method that improves the accuracy of entity representation learning. Traditional negative sample
Figure BDA0003305637180000157
The sampling method is to uniformly and randomly sample the negative sample triplets. Given a triplet
Figure BDA0003305637180000158
It replaces h or t with any entity. However, a sample replaced in this way is relatively easily distinguished from its positive sample. The method is based on the fact that the knowledge model embedded representation model of entity alignment can distinguish two similar (a positive sample and a negative sample) triples.
Given an entity x needing replacement, unlike a random sampling method from all entities, the project selects an entity with higher entity similarity as a sampling candidate set. Specifically, the method searches the neighbors of the entity x by using Euclidean distance between embedded vectors, selects the similar neighbor entities in the embedded space as a candidate set, and carries out negative samplingSampling, where the number of candidate sets
Figure BDA0003305637180000161
Is a scale and N is the number of entities in the knowledge model. The negative sample sampling mode can ensure that an entity with lower similarity to the entity x cannot be sampled; meanwhile, the sampled negative samples retain similar characteristics (such as attribute types and entity relations) with the positive samples, so that the learning of the entity embedded vector is facilitated.
And the entity alignment module is used for calculating the distance between every two entities of the two knowledge models according to the trained entity embedding vector to obtain the similarity, wherein the distance can be Euclidean distance or Cosine distance. In the case of setting the threshold γ, the entity x ∈ G1Possibly with many y e G2And (7) corresponding. However, in this task, the entity alignment only considers the 1-to-1 case, so the constraint Σ is neededx′∈Xφt(x′,y)≤1,∑y′∈Yφt(x, y') is less than or equal to 1, wherein phit(..) indicates whether two entities are marked to be aligned in t iterations, 1 indicates alignment and 0 indicates non-alignment. With these two constraints, the problem is transformed into a maximum bipartite matching problem with weights, which is solved using a bipartite matching algorithm. Newly marked alignment entities inevitably collide during different iterations, and in order to improve marking quality and satisfy an alignment constraint, a marked alignment entity may be relabeled or become unmarked in subsequent iterations. This project uses simple but effective editing techniques to implement this approach:
before adding a newly marked alignment entity, it is checked whether a conflict results. Considering the case where an entity generates conflicting labels in different iterations, the method may wish to select an entity matching x that provides a higher degree of alignment similarity, assuming that there are two different candidate labels y and y' both matching entity x in the two iterations. Formally, the method computes the following likelihood differences:
Figure BDA0003305637180000162
if it is
Figure BDA0003305637180000171
Indicating that the y entity is more able to match the entity x. Otherwise, y' is more appropriately matched to entity x. Therefore, the matching accuracy is ensured to be higher and more stable in the continuous iteration process.
It can be seen from the above embodiments that, the heterogeneous knowledge graph fusion method and system disclosed by the present invention solve the problems in the prior art that different entities under the same concept are difficult to distinguish by single structural information and limited training data limit the entity embedding expression learning accuracy based on the knowledge graph embedding method, and the like, and the present invention provides a heterogeneous knowledge graph fusion method fusing structural information and attribute information, and the present invention has the effects of making full use of two kinds of information in a graph: and the entity structure and the entity attribute, a structure-based entity representation vector is obtained through a knowledge representation learning model, and the entity representation based on the entity attribute is learned through a twin neural network model based on a shared attention mechanism. And marking the best match found by iterating the two kinds of information each time, and supplementing the best match as new marking data into a training set, so that models of the two kinds of information are mutually assisted and are iteratively enhanced, and finally an entity alignment result with higher accuracy is obtained.
The method and system of the present invention are not limited to the embodiments described in the detailed description, and those skilled in the art can derive other embodiments according to the technical solutions of the present invention, which also belong to the technical innovation scope of the present invention.

Claims (14)

1. A heterogeneous knowledge graph fusion method comprises the following steps:
s1, arranging the data of the knowledge graph to be matched, and forming a knowledge graph training set by the knowledge graph independently generated by various data with different sources and different structures, wherein the knowledge graph training set comprises a structure training set and an entity attribute training set;
s2, learning entity expression vectors from knowledge map structure information and knowledge map attribute information required by matching learning entities in the deep learning model through a knowledge map expression learning technology;
s3, calculating the similarity between the entities in the two domain knowledge graphs according to the entity expression vectors, wherein the similarity calculation method is based on Euclidean distance or Cosine distance;
and S4, finding the best match among the entities according to the calculated entity similarity, wherein the best matching entities are entity matching sets, namely an entity matching set based on the structure information and an entity matching set based on the attribute information.
2. The heterogeneous knowledge-graph fusion method of claim 1, which is characterized in that: step S1 is preceded by the step
S11, generating an initial training set from the aligned atlas data sets;
s12: setting a plurality of iteration times through a structure embedding learning model, and learning the embedding expression of the entity structure;
s13: aligning the entities and supplementing the best matching entity set with higher reliability into a training set;
s14: embedding the training attributes into the learning model by using the supplemented training set, setting a plurality of iteration times, aligning the entity and further supplementing the training set for continuously training the structure embedded model.
3. The heterogeneous knowledge-graph fusion method of claim 2, which is characterized in that: step S4 further includes supplementing the initial training set in step S11 with the set of best matching entities with higher reliability, and supplementing the size of the initial training set, so that the structure information and the attribute information can be mutually promoted in the iterative training.
4. The heterogeneous knowledge-graph fusion method of claim 3, wherein: the complementary set of best matching entities contains only positive samples and does not use the nearest neighbor approach to sample negative samples.
5. The heterogeneous knowledge-graph fusion method of claim 1, which is characterized in that: in step S2, learning the structure information, learning entity triples by using knowledge representation learning technology, and representing the entities of the domain knowledge graph as vectors containing the structure information;
the attribute information of the entity comprises attribute categories and corresponding attribute values of the entity, and the twin neural network model based on the shared attention mechanism is adopted for learning the attribute-based entity representation aiming at the attribute learning of the entity.
6. The heterogeneous knowledge-graph fusion method of claim 5, wherein: aiming at the entity structure information learning, based on the limit loss function, the method adopts an objective function OTSR
Figure FDA0003305637170000021
Wherein gamma is1,γ2Two superparameters, the scores of the positive samples and the negative samples are limited, mu is more than 0 and represents the superparameters for balancing the negative samples, and f (tau) is less than or equal to gamma1And f (τ'). gtoreq.gamma2
7. The heterogeneous knowledge-graph fusion method of claim 3, wherein: step one, entity attribute information learning comprises the steps that in the process of processing entity attribute information, a twin network is established to process attribute information of two knowledge maps; processing character strings of attribute values of different data structures, establishing a character-level attribute value representation learning model, and learning information of entity attribute values; a self-attentiveness mechanism is constructed to learn the importance of different entity attributes, and a shared attentiveness mechanism is proposed, such that the embedded representation of the attribute class shares the same attentiveness weight as the embedded representation of the attribute value.
8. The heterogeneous knowledge-graph fusion method of claim 7, wherein: the attention weight is calculated as follows:
αi=softmax(QTWaqi)
wherein alpha isiWeight representing ith attribute, Q representing attribute class matrix, QiRepresents the category of the ith attribute, and W is a weight matrix.
9. The heterogeneous knowledge-graph fusion method of claim 8, wherein: and finally, calculating contrast loss by using the Euclidean distance or the Cosine distance as an objective function, wherein the 'twin network' loss function of the entity attribute adopts a contrast function:
Figure FDA0003305637170000031
wherein Y isiAlignment tag representing the ith sample, entity x of two knowledge models1,x2Entity representation vectors in a "twin" network
Figure FDA0003305637170000032
Distance between two entity vectors:
Figure FDA0003305637170000033
attribute category vector vtype=∑iαiqi(ii) a Vector of attribute values vvalue=∑iαivi
10. The heterogeneous knowledge-graph fusion method of claim 9, wherein: before supplementing the initial training set in step S11 with the most reliable best matching entity set, checking whether a conflict is caused, and selecting the entity matching x with higher alignment similarity by the following method, assuming that two different candidate labels y and y' both match the entity x in two iterations:
Figure FDA0003305637170000034
if it is
Figure FDA0003305637170000035
Indicating that y entity is more likely to match entity x, whereas y' is more likely to match entity x.
11. A heterogeneous knowledge-graph fusion system is characterized in that: the system comprises
The data preprocessing module is used for sorting the data of the knowledge graph to be matched and forming a knowledge graph training set comprising a structure training set and an entity attribute training set from the knowledge graphs independently generated by the data with different sources and different structures;
the deep learning module is used for learning the knowledge map structure information and the knowledge map attribute information required by matching the entity from the deep learning model through a knowledge map representation learning technology, and learning the entity representation vector from the knowledge map structure information and the knowledge map attribute information;
the similarity calculation module is used for calculating the similarity between the entities in the two domain knowledge maps according to the entity expression vector, and the similarity calculation method is based on Euclidean distance or Cosine distance;
and the entity alignment module is used for finding the best matching between the entities according to the calculated entity similarity, wherein the best matching entities are entity matching sets, namely an entity matching set based on the structure information and an entity matching set based on the attribute information.
12. The heterogeneous knowledge-graph fusion system of claim 11 wherein: the data preprocessing module is further used for generating an initial training set from the aligned atlas data sets;
the deep learning module is used for setting a plurality of iteration times through a structure embedding learning model and learning the embedding expression of the entity structure;
the entity alignment module is used for aligning entities and supplementing the best matching entity set with higher reliability into a training set;
the deep learning module is also used for embedding the training attributes into the learning model by using the supplemented training set, setting a plurality of iteration times, aligning the entity and further supplementing the training set, and is used for continuing to train the structure embedded model.
13. The heterogeneous knowledge-graph fusion system of claim 12 wherein: the set of best matching entities supplemented into the training set contains only positive samples and does not use the nearest neighbor method to sample negative samples.
14. The heterogeneous knowledge-graph fusion system of claim 11 wherein: aiming at the structural information learning, the entity triples are learned by using a knowledge representation learning technology, and the entities of the domain knowledge graph are represented as vectors containing structural information;
the attribute information of the entity comprises attribute categories and corresponding attribute values of the entity, and the twin neural network model based on the shared attention mechanism is adopted for learning the attribute-based entity representation aiming at the attribute learning of the entity.
CN202111202752.2A 2021-10-15 2021-10-15 Heterogeneous knowledge graph fusion method and system Pending CN114090783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111202752.2A CN114090783A (en) 2021-10-15 2021-10-15 Heterogeneous knowledge graph fusion method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111202752.2A CN114090783A (en) 2021-10-15 2021-10-15 Heterogeneous knowledge graph fusion method and system

Publications (1)

Publication Number Publication Date
CN114090783A true CN114090783A (en) 2022-02-25

Family

ID=80297057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111202752.2A Pending CN114090783A (en) 2021-10-15 2021-10-15 Heterogeneous knowledge graph fusion method and system

Country Status (1)

Country Link
CN (1) CN114090783A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708540A (en) * 2022-04-24 2022-07-05 上海人工智能创新中心 Data processing method, terminal and storage medium
CN115099606A (en) * 2022-06-21 2022-09-23 厦门亿力吉奥信息科技有限公司 Training method and terminal for power grid dispatching model
CN115774788A (en) * 2022-11-21 2023-03-10 电子科技大学 Negative sampling method for knowledge graph embedded model
CN115905561A (en) * 2022-11-14 2023-04-04 华中农业大学 Body alignment method and device, electronic equipment and storage medium
CN115982374A (en) * 2022-12-02 2023-04-18 河海大学 Dam emergency response knowledge base linkage multi-view learning entity alignment method and system
CN116069956A (en) * 2023-03-29 2023-05-05 之江实验室 Drug knowledge graph entity alignment method and device based on mixed attention mechanism
CN116384494A (en) * 2023-06-05 2023-07-04 安徽思高智能科技有限公司 RPA flow recommendation method and system based on multi-modal twin neural network
CN116628247A (en) * 2023-07-24 2023-08-22 北京数慧时空信息技术有限公司 Image recommendation method based on reinforcement learning and knowledge graph

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472065A (en) * 2019-07-25 2019-11-19 电子科技大学 Across linguistry map entity alignment schemes based on the twin network of GCN
CN110941722A (en) * 2019-10-12 2020-03-31 中国人民解放军国防科技大学 Knowledge graph fusion method based on entity alignment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472065A (en) * 2019-07-25 2019-11-19 电子科技大学 Across linguistry map entity alignment schemes based on the twin network of GCN
CN110941722A (en) * 2019-10-12 2020-03-31 中国人民解放军国防科技大学 Knowledge graph fusion method based on entity alignment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KAI YANG ET AL.: "COTSAE: CO-Training of Structure and Attribute Embeddings for Entity Alignment", 《THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-20)》, 3 April 2020 (2020-04-03), pages 3026 - 3029 *
ZEQUN SUN ET AL.: "Bootstrapping Entity Alignment with Knowledge Graph Embedding", 《PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-18)》, 31 December 2018 (2018-12-31), pages 4396 - 4399 *
朱继召;乔建忠;林树宽;: "表示学习知识图谱的实体对齐算法", 东北大学学报(自然科学版), no. 11, 15 November 2018 (2018-11-15) *
杜文倩;李弼程;王瑞;: "融合实体描述及类型的知识图谱表示学习方法", 中文信息学报, no. 07, 15 July 2020 (2020-07-15) *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708540A (en) * 2022-04-24 2022-07-05 上海人工智能创新中心 Data processing method, terminal and storage medium
CN114708540B (en) * 2022-04-24 2024-06-07 上海人工智能创新中心 Data processing method, terminal and storage medium
CN115099606A (en) * 2022-06-21 2022-09-23 厦门亿力吉奥信息科技有限公司 Training method and terminal for power grid dispatching model
CN115099606B (en) * 2022-06-21 2024-06-07 厦门亿力吉奥信息科技有限公司 Training method and terminal of power grid dispatching model
CN115905561B (en) * 2022-11-14 2023-11-10 华中农业大学 Body alignment method and device, electronic equipment and storage medium
CN115905561A (en) * 2022-11-14 2023-04-04 华中农业大学 Body alignment method and device, electronic equipment and storage medium
CN115774788A (en) * 2022-11-21 2023-03-10 电子科技大学 Negative sampling method for knowledge graph embedded model
CN115774788B (en) * 2022-11-21 2024-04-23 电子科技大学 Negative sampling method for knowledge graph embedding model
CN115982374A (en) * 2022-12-02 2023-04-18 河海大学 Dam emergency response knowledge base linkage multi-view learning entity alignment method and system
CN115982374B (en) * 2022-12-02 2023-07-04 河海大学 Multi-view learning entity alignment method and system for dam emergency response knowledge base linkage
CN116069956B (en) * 2023-03-29 2023-07-04 之江实验室 Drug knowledge graph entity alignment method and device based on mixed attention mechanism
CN116069956A (en) * 2023-03-29 2023-05-05 之江实验室 Drug knowledge graph entity alignment method and device based on mixed attention mechanism
CN116384494B (en) * 2023-06-05 2023-08-08 安徽思高智能科技有限公司 RPA flow recommendation method and system based on multi-modal twin neural network
CN116384494A (en) * 2023-06-05 2023-07-04 安徽思高智能科技有限公司 RPA flow recommendation method and system based on multi-modal twin neural network
CN116628247A (en) * 2023-07-24 2023-08-22 北京数慧时空信息技术有限公司 Image recommendation method based on reinforcement learning and knowledge graph
CN116628247B (en) * 2023-07-24 2023-10-20 北京数慧时空信息技术有限公司 Image recommendation method based on reinforcement learning and knowledge graph

Similar Documents

Publication Publication Date Title
CN114090783A (en) Heterogeneous knowledge graph fusion method and system
CN112131404B (en) Entity alignment method in four-risk one-gold domain knowledge graph
Zheng et al. Exploiting sample uncertainty for domain adaptive person re-identification
CN110334219B (en) Knowledge graph representation learning method based on attention mechanism integrated with text semantic features
CN109299284B (en) Knowledge graph representation learning method based on structural information and text description
Hao et al. An end-to-end architecture for class-incremental object detection with knowledge distillation
CN110826303A (en) Joint information extraction method based on weak supervised learning
Wang et al. Cost-effective object detection: Active sample mining with switchable selection criteria
CN113569615A (en) Training method and device of target recognition model based on image processing
WO2021139257A1 (en) Method and apparatus for selecting annotated data, and computer device and storage medium
CN111666427A (en) Entity relationship joint extraction method, device, equipment and medium
Yuan et al. Joint multimodal entity-relation extraction based on edge-enhanced graph alignment network and word-pair relation tagging
CN111460824A (en) Unmarked named entity identification method based on anti-migration learning
CN116932722A (en) Cross-modal data fusion-based medical visual question-answering method and system
CN112115993A (en) Zero sample and small sample evidence photo anomaly detection method based on meta-learning
CN114863091A (en) Target detection training method based on pseudo label
CN113821668A (en) Data classification identification method, device, equipment and readable storage medium
CN113065409A (en) Unsupervised pedestrian re-identification method based on camera distribution difference alignment constraint
CN114281931A (en) Text matching method, device, equipment, medium and computer program product
CN115618097A (en) Entity alignment method for prior data insufficient multi-social media platform knowledge graph
CN117350330A (en) Semi-supervised entity alignment method based on hybrid teaching
CN113688203B (en) Multi-language event detection method based on movable different composition
Ou et al. Cross-modal generation and pair correlation alignment hashing
CN116386895B (en) Epidemic public opinion entity identification method and device based on heterogeneous graph neural network
Perdana et al. Instance-based deep transfer learning on cross-domain image captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination