CN116108351A

CN116108351A - Cross-language knowledge graph-oriented weak supervision entity alignment optimization method and system

Info

Publication number: CN116108351A
Application number: CN202310055330.XA
Authority: CN
Inventors: 贾萌萌; 刘琰; 刘粉林; 郭晓宇; 吴文昊
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-05-12

Abstract

The invention discloses a cross-language knowledge graph-oriented weak supervision entity alignment optimization method and a cross-language knowledge graph-oriented weak supervision entity alignment optimization system. Experiments on DBP15K show that the method can effectively reduce the scale of the marked data, can achieve the accuracy similar to that of the Dual-AMN method by only relying on 10% marked data, and has at least 3% improvement in the performance of the method on the Hit@1 value compared with a baseline method.

Description

Cross-language knowledge graph-oriented weak supervision entity alignment optimization method and system

Technical Field

The invention belongs to the technical field of cross-language knowledge graph entity alignment, and particularly relates to a cross-language knowledge graph oriented weak supervision entity alignment optimization method and system.

Background

In recent years, with research of multi-language knowledge graph embedding, tasks such as latent semantic representation of entities and cross-language knowledge reasoning are achieved to a certain extent, so that many knowledge-driven cross-language works are promoted. However, due to the low degree of entity alignment between the language knowledge maps, the accuracy of downstream work such as cross-language reasoning, cross-language knowledge map completion, and cross-language connection prediction is often not satisfactory. In real life, the system inevitably contains a plurality of small language entities, knowledge maps of different languages have intersections and differences in the field of electronic commerce, question-answering systems and artificial intelligence, and professional information of each map can be greatly enriched by carrying out entity alignment and integration on the knowledge maps of different languages, so that more aligned entities among the cross-language knowledge maps are mined, and the system is very important for a plurality of downstream tasks and practical applications.

The cross-language knowledge graph stores real-world knowledge in the form of triples (e.g., < entities, relationships, entities >) that exist in at least two or more language representations. Wherein the relationship connects two entities. At present, knowledge patterns are widely applied to the fields of information retrieval, artificial intelligence and the like, and because different knowledge patterns often contain complementary information, the complementary information plays an important role in improving the quality of the knowledge patterns, so that many researchers focus on the alignment of the knowledge patterns. Entity alignment is used as a key step of knowledge graph alignment, and aims to identify equivalent entities in different knowledge graphs, so that the method has important significance for knowledge fusion and construction of higher-quality knowledge graphs.

Most of the existing entity alignment methods can be classified into a supervised method, an unsupervised method, and a semi-supervised method. The supervised entity alignment task obtains the aligned entities by mapping a plurality of knowledge maps to a unified vector space, then relying on a large amount of marking data as training data, and finally calculating the distances between the entities. Given the difficulty of obtaining the marking data, manually gathering enough marked seeds is very expensive and time consuming. To eliminate the reliance on the tag data, previous studies have been conducted on unsupervised entity alignment methods. The unsupervised entity alignment task generally obtains some approximate candidate seed sets according to the side information of the knowledge graph, however, the method has noise problem and effectively solves the noise problem, so that the screening of alignment seeds with higher reliability becomes a difficulty to be solved in the unsupervised method. The semi-supervised learning can complete entity alignment tasks by using the marked data and the unmarked data, and the existing semi-supervised entity alignment method selects a plurality of alignment entities with higher confidence in each iteration to be added into a training set, so that the dependence on the marked data can be effectively reduced.

In general, the tag data may provide more reliable information than the unsupervised information, which often requires the participation of the unsupervised data if a small amount of tag data is to be used for training in the entity alignment task. However, alignment entity pairs generated by the unsupervised method still have a high probability of error, and it is very challenging to screen out more reliable unsupervised candidate seeds and to link well with tagged data. The existing entity alignment method needs enough and reliable marking data to complete tasks, and has strict requirements on the quantity and quality of the marking data, if unreliable candidate seeds are added to a basic truth value set, early errors strengthen themselves in learning, and finally the candidate seeds possibly introduce and amplify noise, so that the accuracy of entity alignment is reduced. It can be seen from this that the marking data directly participates in the training process and influences the quality of the final result, playing a critical role in the field of entity alignment.

Disclosure of Invention

Aiming at the problem that the supervised method in entity alignment has high dependency on the marking data, the invention provides a weak supervision entity alignment optimization method and system oriented to cross-language knowledge graphs, which supplement limited training data in the supervised method. In the whole training process, the importance of each pair of entities is determined by weight information generated by a candidate seed optimization selection algorithm, and the training process is optimized based on a weight loss function. The method has the advantages that the model can be trained under small mark data, and at the moment, candidate seeds and mark data are put under a unified learning frame for iterative training, so that the method is a semi-supervision method; the model may also be trained using an automatically generated candidate seed set, in this case an unsupervised approach.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention provides a weak supervision entity alignment optimization method oriented to a cross-language knowledge graph, which comprises the following steps:

step 1: extracting neighborhood relation and semantic character information of entities in the cross-language knowledge graph, and calculating to obtain a preliminary candidate seed set by using similarity;

step 2: reliability evaluation is carried out on the candidate seed set by combining semantic and character information, and candidate seeds with higher reliability are found;

step 3: performing joint training on the optimized and selected candidate seeds and part of marking data;

step 4: and improving the loss function by using the obtained reliability, and finally optimizing the whole training process.

Further, the step 1 includes:

using a sparse matrix to store the entity and relation information in the knowledge graph, and obtaining a neighbor information matrix, wherein diagonal information in the matrix takes the ratio of the number of all triples to the number of the entity, and represents the weight between the entities connected with the matrix;

weight information between entities is calculated according to the following formula:

wherein l _t The number of triplets is represented and,

representing entity e _i And entity e _j Relation r between _ij The higher the weight the number of occurrences in all triples, the more intimate the association between the two entities;

For two cross-language knowledge maps of a source and a target, firstly, translating a non-English attribute name into English by using a machine translation system; then embedding the Glove words to obtain semantic-level average word vectors of the attribute names, and representing semantic information of the entity names by using the average word vectors; for each attribute name, dividing the attribute name by taking two characters as units to obtain an embedded vector of a character level; and finally, respectively calculating the semanteme and character-level similarity of the two knowledge maps by using a cosine similarity function for the obtained semanteme and character-level vector to obtain a preliminary candidate seed subset.

Further, the step 2 includes:

firstly, obtaining entity pairs which are closest to each other based on the obtained candidate seed set, wherein the entity pairs are obtained by calculating entity neighborhood and semantic information; then extracting semantic level vectors and character level vectors of the entity pairs respectively, and performing similarity calculation on the two vectors to obtain semantic and character level credibility among the entities; and finally, evaluating candidate seeds by combining the two credibility.

Further, the step 3 includes:

storing the candidate seed set in a set U, storing the marked data in the set L, and selecting w from the set U ₁ Proportional data, w is selected from the set L ₂ Proportional data, selected data satisfying equation (9) and unable to be repeated:

U*w ₁ +L*w ₂ ＝L*n (9)

wherein, each parameter range needs to satisfy the following formula:

w ₁ ∈[0,n],w ₂ ∈[0,n],n∈[0,1],w ₁ +w ₂ ＝n (10)

next to w ₁ And w ₂ Take different values to selectAnd (3) candidate seeds and marking data in different proportions, and finally, carrying out joint training based on the two selected data.

Further, the step 4 includes:

assigning weight information D to each pair of entities (u, v) in the optimally selected candidate seeds during training _final (u, v) setting the weight score of the marker data to 1, reducing the dependence on sample size and super parameters by fixing the sample loss of the mean and variance to complete the improvement of the loss function, and finally optimizing the whole training process.

Another aspect of the present invention provides a weak supervision entity alignment optimization system for a cross-language knowledge graph, including:

the preliminary candidate seed set obtaining module is used for extracting neighborhood relation and semantic character information of the entities in the cross-language knowledge graph and obtaining a preliminary candidate seed set by using similarity calculation;

the candidate seed optimization selection module is used for carrying out reliability evaluation on the candidate seed set by combining semantic and character information to find candidate seeds with higher reliability;

The joint training module is used for joint training the candidate seeds selected by optimization and part of the marking data;

and the loss function improvement module is used for improving the loss function by utilizing the obtained credibility and finally optimizing the whole training process.

Further, the preliminary candidate seed set derivation module is specifically configured to:

wherein l _t The number of triplets is represented and,

Further, the candidate seed optimization selection module is specifically configured to:

Further, the joint training module is specifically configured to:

U*w ₁ +L*w ₂ ＝L*n (9)

wherein, each parameter range needs to satisfy the following formula:

w ₁ ∈[0,n],w ₂ ∈[0,n],n∈[0,1],w ₁ +w ₂ ＝n (10)

next to w ₁ And w ₂ And taking different numerical values to select candidate seeds and marking data in different proportions, and finally carrying out joint training based on the two selected data.

Further, the loss function improvement module is specifically configured to:

Compared with the prior art, the invention has the beneficial effects that:

the invention analyzes the possibility of interaction between marked data and unmarked data from the defects existing in the existing entity alignment method, and provides a weak supervision entity alignment optimization method and system oriented to cross-language knowledge graphs. The invention takes the candidate seed set generated by the unsupervised method as the supplementary data, synthesizes the semantic and character level distance measurement to select the candidate seed with higher reliability, then sends the candidate seed and the marking data into the model for joint training, and optimizes the whole training process by using a loss function based on the credibility weight. Experimental results show that the method can reach experimental results comparable to the Dual-AMN method under the condition that only 10% of marked data is adopted, and compared with all baseline methods, the performance of the method on Hit@1 values is improved by at least 3%.

Drawings

FIG. 1 is a schematic diagram of a Dual-AMN framework;

FIG. 2 is an example of marking data;

FIG. 3 is a cross-language knowledge graph oriented weak supervision entity alignment optimization method (WSEO) model framework in accordance with an embodiment of the present invention;

FIG. 4 is a process for generating candidate seed nodes according to an embodiment of the present invention;

FIG. 5 is a diagram showing the overall concept of a distance measurement method according to an embodiment of the present invention;

FIG. 6 is a comparison of the performance of an embodiment of the present invention, WSEO and Dual-AMN, using equivalent signature data on three data sets;

fig. 7 is a schematic diagram of a weak supervision entity alignment optimization system architecture oriented to a cross-language knowledge graph according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

(a) Problem definition

Aiming at the problem of overhigh time complexity in entity alignment, mao [ Mao X, wang W, wu Y, et al boosting the speed of entity alignment X: dual attention matching network with normalized hard sample mining [ C ]// Proceedings of the Web Conference 2021.2021:821-832 ] et al propose a less time-consuming entity alignment method (Dual-AMN), which uses a simplified encoder and standardized hard sampling technique to reduce the time complexity of the algorithm, uses knowledge graph relationship, neighbors and other information to generate single entity vectors and cross graph vectors, and finally combines a distance measurement algorithm to obtain aligned entity pairs.

The Dual-AMN method mainly comprises three steps: single entity vector acquisition, cross-map vector acquisition, and distance metrics. Fig. 1 depicts the main flow of the method. Single entity vector acquisition uses a simplified relational attention layer, a relational projection operation is utilized to replace a standard linear transformation matrix so as to reduce additional parameters brought by each entity, finally, embedding layers are spliced to expand multi-hop neighbor information, and a graph 1(1) entity vectors of a single knowledge graph are acquired by using a GNN network, wherein e ₁ And e ₂ Representing an alignment entity; cross-map vector acquisition uses a finite set of proxy vectors to represent alignment relationship information for the cross-map, the finite proxy vectors being obtained by feeding entity, relationship and neighbor information into the GNN model, if the two entities are equivalent, their distances from these proxy vectors are also identical, it can be seen that map 1(2) further obtains the proxy vectors for the cross-map on the basis of map 1(1); after obtaining the entity vector of the cross-map, the cross-map entity vector and proxy need to be calculatedThe distance between vectors is considered to be aligned entity pairs, L2 distance is used in training to obtain similarity between entity and agent vectors, and CSLS distance measure is used in testing to reduce dimension disasters. Figure 1(3) compares the distance of each entity from the agent vector, the distance being consistent and considered as an aligned entity pair; the final Dual-AMN model uses the three steps described above to obtain alignment entity pairs.

Compared with the prior supervised entity alignment algorithm, the Dual-AMN entity alignment method combines a single entity vector with a cross-map vector, simplifies the encoder by using a simplified relation attention layer and a proxy matching attention layer, and selects a negative sample with rich information by using a standardized hard sampling technology. The steps shorten the running time of the whole algorithm, solve the problem that the entity alignment algorithm is long in time consumption, and ensure that the model has high accuracy.

However, this method also has two disadvantages:

1) The Dual-AMN entity alignment method relies on a large amount of marking data for training.

The Dual-AMN entity alignment method relies on 30% of the marking data for training in the training process, and the accuracy obtained by the model is obviously reduced under the condition that the marking data is lower than 30%, and obviously, the Dual-AMN entity alignment method cannot obtain a higher accuracy under the condition that a small amount of marking data is used. However, in real life, a sufficient amount of marking data is difficult to obtain, marking costs are high, and it is not practical to rely too much on marking data. Table 1 below shows the experimental effect of Dual-AMN under different amounts of marking data, it can be clearly seen that the experimental effect is a decreasing trend with decreasing marking data, and the experimental effect is also better and better with increasing marking data.

TABLE 1 hit ratio comparison of Dual-AMN under different ratio alignment seeds

2) Error information may also exist for labeling data in the entity alignment dataset.

Existing entity alignment methods use tag data to connect two knowledge-graphs, and if two entities are tagged as aligned entities, then the two entities are deposited as one entity into a unified vector space. However, some of the marker data may not exist in real life. For example, fig. 2 shows a partial english knowledge graph and spanish knowledge graph, the task of entity alignment aims at identifying equivalent entity pairs between the two knowledge graphs, whereas the equivalence between Hirokazu Koreeda in english and Hirokazu Koreeda in spanish is not known since they are not present in real life. While training the model, the iterative process may propagate and amplify these noises, thereby degrading the physical alignment performance.

The principle and the shortcomings of the Dual-AMN entity alignment method are described above, and the following is a definition of the main problems involved in the method of the present invention.

(b) Entity alignment definition

Cross-language knowledge graph: a cross-language knowledge graph can be described as a graph g= (E, R, T) containing three sets,

In triplets of<Entity, relationship, entity>In the form of (c) stores real world information, where triples are used to describe the inherent relationships between entities. Wherein E represents a set of entities, R represents a set of relationships, an entity is a specific thing (such as Beijing university, beijing, china, etc.), and the edges represent the relationship (such as being located) between the things, then a triplet<University of Beijing, china>The meaning of the expression is that Beijing university is located in Beijing, china.

Entity alignment: entity alignment is intended to identify equivalent entity pairs in two cross-language knowledge-graphs. Given two knowledge maps G ₁ ＝(E ₁ ,R ₁ ,T ₁ ) And G ₂ ＝(E ₂ ,R ₂ ,T ₂ ) Wherein E is ₁ 、E ₂ Representing a collection of entities, R ₁ 、R ₂ Representing entitiesRelationship between T ₁ 、T ₂ Representing a triplet set and marking data p= { (u, v) |u E ₁ ,v∈E ₂ U≡v }, where≡represents the equivalence, the model is finally passed through G ₁ ,G ₂ And P gets more potential equivalent entity pairs.

Alignment of weak supervision entities oriented to cross-language knowledge graphs, namely that the Dual-AMN method is too dependent on the marking data, the method provides an improvement strategy aiming at the defects of the Dual-AMN method. The method aims at generating a candidate seed set C by using neighborhood relation and semantic character information, and then generating a candidate seed set C and marking data P= { (u, v) |u epsilon E by using the candidate seed set C and marking data P= { (u, v) |u epsilon E ₁ ,v∈E ₂ U≡v } together generate training data, and the final algorithm uses G ₁ ＝(E ₁ ,R ₁ ,T ₁ )、G ₂ ＝(E ₂ ,R ₂ ,T ₂ ) And C and P get alignment entity pairs.

1. The method of the invention

Because of the great similarity of entity semantic names in the cross-language knowledge graph, the similarity is more easily captured by a model after the cross-language entities are translated into the same language through a machine translation system. In order to realize a weak supervision entity alignment optimization framework and use the least possible marking data in the whole model, the invention provides a weak supervision entity alignment optimization method (WSEO) oriented to a cross-language knowledge graph, and the WSEO firstly uses semantic similarity among entities to generate a candidate seed subset. Here we do not use complex models to generate candidate seed subsets, but use entity neighborhood relations as auxiliary information for attribute names, resulting in more comprehensive neighborhood-semantic similarity.

As shown in fig. 3, our method model contains two parts: an unsupervised candidate seed generation portion and a supervised training portion. The non-supervision candidate seed generation part mainly comprises candidate node generation and optimization selection of candidate seeds, and the supervision training part comprises joint training of the candidate seeds and the marking data and correction of a loss function.

Given two cross-language knowledge graphs, firstly extracting neighborhood relation and semantic character information of an entity in the knowledge graph through an unsupervised candidate seed generation part, and finding a preliminary candidate seed set by using cosine similarity; then, reliability evaluation is carried out on the candidate seed set by combining semantic and character information in an optimization selection algorithm, and candidate seeds with higher reliability are found out; in the supervised training part, the candidate seeds selected in an optimization mode and part of marking data are sent into an entity alignment method model together for joint training; and the reliability improvement loss function obtained by the optimization selection algorithm is utilized to finally optimize the whole training process.

The important role of candidate seed generation is to generate a preliminary candidate seed set, and the main problem of this part is to generate the most similar N nodes of an entity and corresponding weight information. Because the middle parts of the preliminary candidate seeds contain error information, the preliminary candidate seeds are further optimized and selected, and the candidate seeds obtained after the optimized and selected are training data to be used. Because the optimal selection weights of all nodes are different, in order to obtain more scientific training results, a loss function improvement strategy based on the credibility weights is provided. In the whole method model, how to select the candidate seeds with higher reliability directly influences the experimental performance, so the optimal selection of the candidate seeds is an important link for determining the quality of the model.

1.1 candidate seed Generation

In this section we will focus on the generation of a subset of candidate species. The candidate seeds are unsupervised data, can be used as supplementary data of the marking data, and can be used for selecting different proportions to construct the training data during training. The data selected by a more optimized candidate seed selection algorithm can be considered to have higher accuracy, and the more complex model can bring about redundancy of the model. This is considered significant if a preliminary candidate seed set can be obtained using readily available source data and a convenient and efficient method, then the candidate sets are effectively optimally selected, and the experimental results ultimately achieve accuracy comparable to existing methods.

The knowledge graph has a plurality of side information, such as entity neighborhood, attribute, description and the like, and because of the huge difference of the cross-language knowledge graph in the text expression, the common knowledge graph alignment method is not applicable to the cross-language entity alignment, however, the cross-language entity has similar structure information, so that the cross-language knowledge graph is subjected to entity alignment work, the similarity between the entities is judged according to the neighborhood information, or the cross-language entity is converted into the same language entity according to a specific method, and then the aligned entity is judged. Wherein neighborhood information is important for entity alignment, and if two entities are aligned entity pairs, their corresponding neighborhood information is also similar; attribute names widely exist in most knowledge graphs, and in consideration of availability of data, more typical attribute names are used as source data in the work; the neighborhood information and the attribute information are combined to jointly generate a preliminary subset of candidate species.

1.1.1 neighborhood information construction based on relational evaluation

In order to obtain neighbor information of a cross-language knowledge graph, a sparse matrix is used for storing entity and relation information in the knowledge graph to obtain a neighbor information matrix. The diagonal information in the matrix takes the ratio of the number of all triples to the number of entities to represent the weight between the entities connected with the diagonal information. Since the importance of the relationship reflects the importance between entities to some extent, the weight information between entities is determined by the relationship information. The more important a pair of entities is, the less space the knowledge-graph is available for the entities to choose from. Such as triplet (e) _i ,r ₁ ,e _j ) And (e) _m ,r ₂ ,e _n ) Wherein is equal to r ₁ The relevant entities are (e _h ,e _i ，...,e _j ) J entities, and r ₂ The relevant entities are (e _p ,e _q ，...,e _n ) N entities in total, and j < n, then for entity e _i For example, the tail entity may alternatively range from a ratio e _m Small, for entity e _i The greater the likelihood of locating a particular entity, entity e _i And entity e _j The relation between them is compared with e _m And e _n And more compact in terms of the fact. The weight information between entities is obtained by:

wherein l _t The number of triplets is represented and,

representing entity e _i And entity e _j Relation r between _ij The higher the weight the number of occurrences in all triples, the more tightly the relationship between the two entities. After neighborhood information among entities is obtained, the neighborhood information matrix and the entity semantic character level vector are combined, and a preliminary candidate seed set is obtained through similarity calculation.

1.1.2 comprehensive semantic-character similarity calculation

For two cross-language knowledge maps of a source and a target, firstly, non-English attribute names are translated into English by using a machine translation system; then embedding the Glove words to obtain semantic-level average word vectors of the attribute names, and representing semantic information of the entity names by using the average word vectors; in order to obtain a candidate seed subset from a more comprehensive view, for each attribute name, dividing the attribute name by taking two characters as units to obtain an embedded vector of a character level; finally, the obtained semantic and character-level vectors are respectively calculated by using cosine similarity functions to obtain the semantic and character-level similarity of the two knowledge maps, and a preliminary candidate seed subset is obtained; when the initial similarity is calculated, the TopN nearest distances and corresponding weights of the entities in the two knowledge maps are obtained, the entities of the TopNs can be used for optimizing selection of candidate seeds, and the weight information is used for adjusting the corresponding loss function, so that the alignment entity can be guided more accurately in the training process. Fig. 4 depicts the generation of an initial subset of candidate species.

1.2 optimization selection of candidate seeds

The weak supervision algorithm oriented to cross-language entity alignment needs initial candidate seeds, and more aligned entities are gradually generated in the iterative process of the model. Therefore, the optimal selection of the initial candidate seeds is of great importance. Good seeds need to meet two conditions: similar neighbor information exists and semantics are similar. The seed set selected by the optimization selection algorithm can be more reliable during training. Section 1.1 generates an initial candidate seed subset from the neighborhood and semantic character information, and based on the initial candidate seed subset, the section designs a candidate seed optimization selection algorithm from the two-stage distance measurement angle of the semantic character.

The preliminary candidate seeds generated by entity neighborhood information and attribute names still have much noise, and in order to obtain a better candidate seed set, we designed an optimized selection strategy based on semantic-character distance measurement. Specifically, the algorithm first obtains entity pairs that are closest to each other, the entity pairs being computed from entity neighbors and semantic information; then the algorithm extracts the semantic level vector and the character level vector of the entity pairs respectively, and performs similarity calculation on the two vectors to obtain semantic and character level credibility between the entities; finally, we integrate the two credibility to evaluate candidate seeds; fig. 5 illustrates the general idea of distance measurement.

1.2.1 entity semantic-character level distance metric

Fig. 5 gives a description of the entity semantic-character level distance metric. Where u and v are the closest distances to each other, and the calculated semantic similarity is d ₁ The method comprises the steps of carrying out a first treatment on the surface of the Starting from an entity u in KG1, finding a second-near entity v' of u in KG2, and calculating the semantic similarity as d ₂ The method comprises the steps of carrying out a first treatment on the surface of the Finally calculate d ₁ And d ₂ Is the difference D of (2) ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then, starting from the entity v in KG2, obtaining a difference D of semantic similarity ₂ The method comprises the steps of carrying out a first treatment on the surface of the The average of the two differences is the overall semantic similarity (Ds). The similarity of the character level is similar to the semantic level similarity calculation method, and the overall character level similarity (Dc) is calculated. The combination of semantic level similarity (Ds) and character level similarity (Dc) can obtain the weight of the entity pairs, and the greater the weight is, the greater the possibility that the two pairs of entities are aligned entities is indicated.

The algorithm firstly searches two nearest distances v and v 'of an entity u in the source knowledge-graph in the target knowledge-graph, and calculates the distances between the entity u and v', wherein the cosine similarity is used for calculation:

D ₁ ＝|Cos(u,v)-Cos(u,v')|(2)

then, searching two entities u and u 'with the closest distances of the target knowledge-graph in the source knowledge-graph, and respectively calculating the distances between the target entity v and the entities u and u':

D ₂ ＝|Cos(v,u)-Cos(v,u')|(3)

finally, calculating the average of the two distances, which is to integrate the angles of the two knowledge maps to obtain the similarity of the entity semantic level.

D _s ＝(D ₁ +D ₂ )/2 (4)

Next we get a distance measure at the character level. Here we use Jaccard similarity to calculate character-level similarity. Likewise, the distance of the source entity u from the target entities v and v' is calculated:

D ₃ ＝|Jaccard(u,v)-Jaccard(u,v')| (5)

then, searching two entities u and u 'with the closest distances of the target knowledge-graph in the source knowledge-graph, and respectively calculating Jaccard distances between the target entity v and the entities u and u':

D ₄ ＝|Jaccard(v,u)-Jaccard(v,u')| (6)

and (3) calculating the average of the two distances by combining the distance measurement results between the nodes:

D _c ＝(D ₃ +D ₄ )/2 (7)

from this we get the semantic and character level metrics Ds and Dc of the two knowledge maps, respectively.

1.2.2 candidate seed node evaluation

The values of Ds and Dc comprehensively consider multi-angle information, and the most similarity calculation is used in the whole process, so that more super parameters are not introduced. Thus, we have completed the design of the optimization selection algorithm using fewer parameters. Finally, the evaluation value of the candidate seeds is obtained through the formula (8).

D _final ＝D _s +D _c (8)

The optimization selection algorithm may rank better candidate seeds because the candidate seeds it matches integrate the neighborhood and semantic character multi-angle distance metric information, the results are all optimal for each other, and the distance between the matched entity pairs needs to be below a certain preference factor. By similarity calculation we limit the final result between 0,1, the larger the value is, the higher the reliability that is considered as an alignment entity.

1.3 Joint training of candidate seeds and marker data

The method is impractical when facing massive data, fortunately, alignment entities generally have similar neighborhood information, so we can start from the structure information to mine cross-language entity embedding, and perform alignment work through vectors processed by the neural network. The Dual-AMN entity alignment method selects part of the marking data to send into the model for training, and the method is too dependent on the marking data. The method of the invention adds candidate seed sets based on the marked data, the candidate seed sets are stored in the set U, the marked data are stored in the set L, and we need to select w in the set U ₁ Proportional data, w is selected from the set L ₂ Proportional data, which are selected to satisfy the equation (9) and cannot be repeated. Next we want to pair w in both sets ₁ And w ₂ And taking different numerical values to select candidate seeds and marking data in different proportions, and finally, sending the two selected data together into a WSEO method for joint training.

U*w ₁ +L*w ₂ ＝L*n (9)

Wherein, each parameter range needs to satisfy the following formula:

w ₁ ∈[0,n],w ₂ ∈[0,n],n∈[0,1],w ₁ +w ₂ ＝n (10)

1.3.1 relational awareness attention layer

The input of the model is two matrices, H ^e ∈R ^|E|×d Representing the initial entity vector, H ^r ∈R ^|R|×d Representing the initial relationship features. The model uses a relationship awareness mechanism to aggregate neighbor information for an entity. Entity e _i The output vector at the first layer can be obtained by:

in the above formula, tanh is an activation function, and the relation projection operation generates a relation vector of each entity, and no additional parameter is generated in the operation, wherein alpha _ijk The calculation is as follows:

wherein v is ^T Is an attention vector, softmax operation selects a critical path to connect with an entity from various types of relationships. In the former study, GNN can extend multi-hop neighbor information by stacking multiple layers, and thus, models splice embedded information of different layers to obtain entity e _i Multi-hop embedding of (c):

in the above formula, ||indicates a splicing operation. During the test, the method model uses CSLS [ Conneau A, sample G, ranzato M A, et al, word translation without parallel data [ J ]. ArXiv preprint arXiv:1710.04087,2017 ] to calculate the distance between pairs of entities, the smaller the distance the greater the likelihood of being an aligned entity.

1.3.2 agent vector attention layer

The simplified relationship attention layer gets embedded vectors of a single knowledge graph, while achieving entity alignment needs to be done in a unified vector space, whereby the model uses a vector matching attention layer. We use a limited number of proxy vectors to represent alignment information across the graph and if the two entities are equivalent they should be consistent with these proxy vector similarity distributions. In this way, the layer can capture alignment information across graphs without computing node-to-node interactions, the agent vector attention layer interactions are computing similarities between all entities and finite agent vectors, which operates similarly to clustering.

The inputs to the agent vector attention layer are two matrices: h ^e ∈R ^|E|×ld Entity embedded information captured by representing simplified relation attention layer, Q epsilon R ^n×ld Representing the agent vectors after random initialization, where n represents the number of agent vectors. In order to obtain entity embedded information of the cross graph, the similarity between each entity and all agent vectors is calculated in the first step:

S _p representing a set of proxy vectors. Here we use cosine similarity to calculate the similarity between embedded vectors, entity e _i Can be calculated by the following formula:

describe->

Differences from all agent vectors. Finally, we use door mechanism and +.>

And->

To control information flow between single and cross-over graphsMoving:

m and b represent the weight matrix and the gate bias matrix, respectively.

1.4 seed reliability sensitive loss function optimization

The candidate seeds selected by the optimization selection algorithm have different degrees of credibility, and the credibility has different degrees of influence on model training as weight information between entity pairs, so that a loss function is corrected in the training process of the model, namely, weight information D is given to each pair of entities (u, v) _final (u, v) may make training results of the model more reasonable. In addition to optimizing the weight information obtained by the selection algorithm, we have the same reason to consider that negative samples related to these entities have the same confidence score for the candidate seeds that were evaluated, and in addition, we set the weight score for the marker data to 1. We reduce model dependence on sample size and hyper-parameters by fixing the sample loss of the mean and variance.

Where P represents the positive sample and e' represents the entity in the negative sample.

ln(e _i ,e _j ,e' _j ) Representing triples (e _i ,e _j ,e' _j ) Normalization losses, τ and λ ² Each representing the mean and variance of the normalization loss. Wherein ln (e) _i ,e _j ,e' _j ) The definition is as follows:

l _o (e _i ,e _j ,e' _j )＝γ+sim(e _i ,e _j )-sim(e _i ,e' _j ) (20)

middle l above _o (e _i ,e _j ,e' _j ) Representing an initialisation triplet (e _i ,e _j ,e' _j ) Loss, mu and sigma ² Mean and variance representing initial loss, calculated as:

/>

ln(e _j ,e _i ,e _i ') calculation procedure and ln (e) _i ,e _j ,e' _j ) Similarly.

During training, we select candidate seed sets with different proportions as supplementary data, retain a small part of the marked data, and use L ₂ The distance calculates the similarity between the entities.

To verify the effect of the present invention, the following experiments were performed

2. Experiment

Our invention uses the Keras framework to develop our model. Experiments were run on a workstation with NVIDIA TITAN RTX GPU and 128GB memory.

2.1 Experimental setup

2.1.1 data sets

To verify the validity of the model, we select three typical cross-language knowledge patterns in the entity alignment domain to verify the weak supervisory entity alignment work, and the three data sets are described in detail below.

DBP15K: comprises four knowledge-graph types of specific language, namely English (En), chinese (Zh), french (Fr) and Japanese (Ja), and the four knowledge-graph types are extracted from DBpedia, and each knowledge-graph comprises about 65-106k entities . The dataset includes three cross-language sets constructed from DBpedia: english-French (DBP) _EN-FR ) English-Chinese (DBP) _EN-ZH ) English-Japanese (DBP) _EN-JA ) Each set contains 15,000 pairs of pre-aligned entity pairs for training and testing.

Table 2 lists the statistics of this dataset. In this study we used 0-30% of the marker data as training data and trained these data in combination with candidate seeds, and the removal of these data was the test data.

Table 2 dbp15k dataset

2.1.2 reference methods and evaluation indicators

Metrics. To make a fair comparison with the former work, we use the same evaluation criteria. We use hit@k and average reciprocal rank (MRR) as the evaluation index. The hit@k score is determined by the proportion of correctly aligned pairs of entities in the alignment result. In particular, hit@1 represents accuracy. To be more convincing, we counted the average performance of five rounds after the experiment was run for those rounds.

Experimental setup. For all data sets we use the same setup. The embedding dimension d=100, gnn number of layers l=2; the number of agent vectors n=64, edge γ=1; the new mean and variance in the normalization loss is set to τ=10, λ=30; training batch size 1024, learning rate set to 0.005; the semantic and character balancing factor is set to β=0.5, and the preference factor is set to δ=0.3.

Baseline method. To further verify the effectiveness of the weakly supervised based algorithm of the present invention, we compare the present algorithm to several typical supervised algorithms, which are described in detail below:

BootEA is a typical semi-supervised entity alignment method, and entities with higher similarity are selected as alignment entities in each iteration.

NAEA [ Zhu Q, zhou X, wu J, et al, neighbor borhod-Aware Attentional Representation for Multilingual Knowledge Graphs [ C ]// IJCAI.2019:1943-1949 ] in combination with the hierarchical information of the neighborhood subgraph, weight the representation of the aggregated neighbors to learn entity embedding.

Transedge centers on edges, ties together entities and relationships as relationships embedded for handling one-to-many and many-to-one relationships.

MRAEA [ Mao X, wang W, xu H, et al MRAEA: an efficient and robust entity alignment approach for cross-lingual knowledge graph [ C ]// Proceedings of the 13th International Conference on Web Search and Data Mining.2020:420-428 ] is a supervised algorithm that considers neighbors and their relationships, and proposes a bi-directional iterative strategy to add new alignment seeds during training.

GCN-Align combines entity structure information and attribute information learning cross-language embedding, uses GCN to embed each language into a unified vector space, a typical supervised approach using graph convolution neural networks.

KeCG [ Li C, cao Y, hou L, et al, semi-supervised entity alignment via joint knowledge embedding model and cross-graph model [ C ]// Proceedings of the 2019Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) & 2019:2723-2732 ] joint knowledge embedding model and semi-supervised entity alignment method of cross-graph model.

AliNet introduces distant neighbors to extend the overlapping portion of the neighbor structure and uses a attentive mechanism to help the distant neighbors reduce noise. And (3) combining the information of the direct neighbors and the remote neighbors to complete the entity alignment task.

2.2 experimental results

In Table 3, we compare the experimental performance of the method of the present invention with the Dual-AMN method on three cross-language datasets, WSEO sets training data according to equation (9), and L: U represents the ratio of the marker data to the candidate seeds. The bolded portion of the table below represents the percentage of performance improvement compared to the Dual-AMN method for WSEO with only 10% of the marked data.

Table 3 experimental performance after addition of supplemental data

As can be seen from the above table, with the addition of the subset of candidate species, the performance of WSEO on the three above-mentioned criteria is increasingly better, with the WSEO performing close to Dual-AMN at Hit%1, hit@10 and MRR at a value of L:U of 1:2, where the WSEO is approximately 1% different from Dual-AMN at Hit@1. The method can achieve experimental performance comparable with that of Dual-AMN on the basis of only using 10% of the marked data, WSEO can reduce 20% of the marked data, and the alignment task of weak supervision entities facing cross-language knowledge graphs is fully realized. The ZH-EN dataset in the table above performs poorly in the WSEO model because there is insufficient data to screen under the influence of the same preference factors, resulting in the use of all data as supplemental data. Notably, when the value of L: U is 0:3, the WSEO uses 30% of the candidate seeds as training data, and the method transitions to an unsupervised entity alignment task. FIG. 6 shows experimental performance of WSEO and Dual-AMN using equivalent signature data on three data sets, JA-EN, FR-EN and ZH-EN.

As can be seen from fig. 6, the performance of WSEO on three data sets is superior to that of Dual-AMN entity alignment method, which fully illustrates that the weak supervisory entity alignment algorithm for cross-language knowledge graph proposed in the present invention can effectively reduce the use of tag data, and the performance is stable.

To further verify the superiority of the present method in performance, we compared the method with existing supervised methods. Table 4 describes the performance of several typical supervised entity alignment methods on three indicators, hit@1, hit@10 and MRR. Where the value of L: U is set to 1:2, the bolded font represents the best performing data in all methods.

TABLE 4 Experimental Properties under different entity alignment methods

Experimental results show that our method can achieve superior performance on both the JA-EN and FR-EN data sets by only relying on 10% of the marker data. The performance of WSEO in numerical values is superior to that of a baseline method in three evaluation indexes of Hit@1, hit@10 and MRR, wherein the performance of the WSEO in the Hit@1 value is improved by 3% -45.3%, and the MRR is improved by 2% -34.4%. In the index, the Hit@1 value directly reflects the accuracy of entity alignment, so that the excellent result of the method on the Hit@1 value further proves the effectiveness of the model. The WSEO performs poorly on the ZH-EN dataset because under equal parameters, the ZH-EN dataset does not have enough candidate seeds to supplement the marker data, which results in the WSEO using all candidate seeds as supplemental data, and when L: U is 2:1, the ZH-EN dataset has enough candidate seeds for our optimization selection algorithm to evaluate, at which time the WSEO performs equally better on the ZH-EN dataset than all baselines.

GCN-Align performance in the above table appears relatively weak because it uses only structural information to generate alignment seeds, tranEdge appears second in all methods because it uses the diversity of relationships, which verifies the importance of relationship information in entity alignment. MRAEA performs best in all baseline methods because it uses meta-relationship and neighbor information and adds new alignment seeds during training. Compared with the three methods, the performance of the WSEO method on Hit@1 is improved by 3%, 4.6% and 2.2%, which fully demonstrates that the method provided by the invention can reduce the use of marking data and is superior in performance.

2.3 ablation experiments

To verify the effectiveness of candidate seed optimization selection algorithm, we are in DBP _JA-EN Ablation experiments were performed on the dataset. Taking JA-EN dataset as an example, there are 4279 candidate seeds which participate in the optimization selection evaluation, we choose from3000 pairs of candidate seeds are selected as supplementary data, 1500 pairs of marker data are selected correspondingly, the value of L: U is 1:2, and the two data are sent into a model for joint training, so that experimental results added with an optimization selection algorithm (NSS) can be obtained. Then, we experiment on the model with the optimization selection Algorithm (ASN) removed, we choose 4279 initial candidate seeds with higher weight in TopN file as the selection space of the supplementary data, at this time, these data do not participate in the evaluation of the optimization selection algorithm, 3000 pairs of entities are randomly selected from these 4279 data as the supplementary data, 1500 pairs of tag data are correspondingly selected, and the value of L: U reaches 1:2. Table 5 shows the experimental results before and after the JA-EN dataset participated in the candidate seed optimization selection.

TABLE 5 performance of JA-EN dataset on model after removal of candidate seed optimization selection algorithm

In Table 5, we can see that the performance of the JA-EN dataset on four indicators is significantly reduced after the WSEO removal optimization selection algorithm. When the candidate seed optimization selection algorithm is not removed by the model, the WSEO can reach an accuracy of 78.9% by means of 10% of marking data, and the values of Hit@10 and MRR reach 93.8% and 84.4% respectively; after the model removes the optimization selection algorithm, the WSEO can only reach 67.6% accuracy on the JA-EN data set, the Hit@1 value is reduced by about 11%, and the Hit@10 and MRR values are respectively reduced by 5.6% and 9.5%. The experimental result shows that the optimal selection algorithm can effectively select candidate seeds with higher reliability, and finally can greatly influence the accuracy of entity alignment.

On the basis of the above embodiment, as shown in fig. 7, the present invention further provides a weak supervision entity alignment optimization system oriented to a cross-language knowledge graph, including:

wherein l _t The number of triplets is represented and,

Further, the joint training module is specifically configured to:

U*w ₁ +L*w ₂ ＝L*n (9)

wherein, each parameter range needs to satisfy the following formula:

w ₁ ∈[0,n],w ₂ ∈[0,n],n∈[0,1],w ₁ +w ₂ ＝n (10)

Further, the loss function improvement module is specifically configured to:

In summary, the invention analyzes the possibility of interaction between the marked data and the unmarked data from the defects existing in the existing entity alignment method, and provides a weak supervision entity alignment optimization method and system for cross-language knowledge graphs. The invention takes the candidate seed set generated by the unsupervised method as the supplementary data, synthesizes the semantic and character level distance measurement to select the candidate seed with higher reliability, then sends the candidate seed and the marking data into the model for joint training, and optimizes the whole training process by using a loss function based on the credibility weight. Experimental results show that WSEO can reach experimental results comparable to the Dual-AMN method with only 10% labeled data, and the performance of the method of the invention at Hit@1 values is improved by at least 3% compared to all baseline methods.

The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims

1. A cross-language knowledge graph-oriented weak supervision entity alignment optimization method is characterized by comprising the following steps:

2. The cross-language knowledge graph oriented weak supervision entity alignment optimization method according to claim 1, wherein the step 1 comprises:

wherein l _t The number of triplets is represented and,

3. The cross-language knowledge graph oriented weak supervision entity alignment optimization method according to claim 1, wherein the step 2 comprises:

4. The cross-language knowledge graph oriented weak supervision entity alignment optimization method according to claim 1, wherein the step 3 comprises:

U*w ₁ +L*w ₂ ＝L*n (9)

wherein, each parameter range needs to satisfy the following formula:

w ₁ ∈[0,n],w ₂ ∈[0,n],n∈[0,1],w ₁ +w ₂ ＝n (10)

5. The cross-language knowledge graph oriented weak supervision entity alignment optimization method according to claim 1, wherein the step 4 comprises:

6. A cross-language knowledge graph oriented weak supervisory entity alignment optimization system, comprising:

7. The cross-language knowledge graph oriented weak supervision entity alignment optimization system of claim 6, wherein the preliminary candidate seed set derivation module is specifically configured to:

wherein l _t The number of triplets is represented and,

8. The cross-language knowledge graph oriented weak supervisory entity alignment optimization system of claim 6, wherein the candidate seed optimization selection module is specifically configured to:

9. The cross-language knowledge graph oriented weak supervisory entity alignment optimization system of claim 6, wherein the joint training module is specifically configured to:

U*w ₁ +L*w ₂ ＝L*n (9)

wherein, each parameter range needs to satisfy the following formula:

w ₁ ∈[0,n],w ₂ ∈[0,n],n∈[0,1],w ₁ +w ₂ ＝n (10)

10. The cross-language knowledge graph oriented weak supervisory entity alignment optimization system of claim 6, wherein the loss function improvement module is specifically configured to:

assigning weight information D to each pair of entities (u, v) in the optimally selected candidate seeds during training _final (u, v) to beThe weight score of the marked data is set to be 1, the dependence on the sample scale and the super parameter is reduced through the sample loss of the fixed mean and the variance, so that the improvement on the loss function is completed, and finally the whole training process is optimized.