CN113656596A

CN113656596A - Multi-modal entity alignment method based on triple screening fusion

Info

Publication number: CN113656596A
Application number: CN202110950895.5A
Authority: CN
Inventors: 唐九阳; 郭浩; 赵翔; 曾维新; 刘丽; 郭延明; 肖卫东
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-11-16
Anticipated expiration: 2041-08-18
Also published as: CN113656596B

Abstract

The invention discloses a multi-modal entity alignment method based on triple screening fusion, which comprises the following steps: acquiring data of two multi-mode knowledge maps; quantifying the importance of the triples by using an unsupervised triple screening module, and filtering part of invalid triples based on the importance scores; respectively learning the structure vectors of the entities of the two multi-mode knowledge maps by using a graph convolutional neural network to generate the structural feature representation of each entity; respectively generating visual feature representations of the respective entities; and combining the entity structural features and the entity visual features of the two multi-modal knowledge maps to perform entity alignment. Aiming at the problem of poor utilization of visual information, the similarity score of an entity-picture is calculated, and more accurate entity visual characteristic representation is obtained based on the similarity; generating a triple score based on the PageRank score and the entity degree, filtering the triples, and mitigating the structural difference of different knowledge maps so that the alignment effect is better.

Description

Multi-modal entity alignment method based on triple screening fusion

Technical Field

The invention relates to the technical field of knowledge maps in natural language processing, in particular to a multi-modal entity alignment method based on triplet screening fusion.

Background

In recent years, knowledge maps have become a widely used representation of structured data. It represents real-world knowledge or events in the form of triples and is widely used for various types of artificial intelligence downstream tasks. At present, a multi-modal knowledge graph is often constructed from limited data sources, and the problems of information loss and low coverage rate exist, so that the knowledge utilization rate is not high. Considering that manual completion of a knowledge graph is costly and inefficient, one possible approach to improve the coverage of knowledge graphs is to automatically integrate useful knowledge from other knowledge graphs. The entity is used as a hub for linking different knowledge graphs and is very important for integrating various multi-modal knowledge graphs. The process of identifying entities in different multimodal knowledge maps that express the same meaning is referred to as multimodal entity alignment.

Multi-modal entity alignment requires the utilization and fusion of information for multiple modalities. However, existing multi-modal entity alignment methods encounter two bottlenecks: first, pattern structure variability is difficult to handle. Based on the assumption that peer entities in different knowledge graphs usually have peer neighbor entities, the current mainstream entity alignment method mainly depends on the structural information of the knowledge graph. However, in the real world, due to different construction modes, different knowledge maps may have large structural differences. For such problems, triples can be generated based on link prediction to enrich structural information, and although the problem of structural diversity is alleviated to a certain extent, the reliability of the generated triples needs to be considered, and the completion difficulty is high for the case that the number of the triples is different by multiple times. Second, visual information utilizes the difference. Current automated methods of constructing multimodal knowledgemaps typically complement information of other modalities based on existing knowledgemaps. To obtain visual information, these methods mainly utilize crawlers to obtain pictures related to the entity from the internet. However, a picture with a low degree of partial correlation, i.e., a noise picture, inevitably exists in the acquired picture. The current method cannot distinguish a noise picture in a picture related to an entity, so that part of noise is mixed in visual information of the entity, and the accuracy rate of the visual information for entity alignment is further reduced.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention discloses a multi-modal entity alignment method based on triple screening fusion.

The technical scheme of the invention is that a multi-modal entity alignment method based on triple screening fusion comprises the following steps:

step 1, acquiring data of two multi-modal knowledge maps, MG₁＝(E₁,R₁,T₁,I₁) And MG₂＝(E₂,R₂,T₂,I₂) Wherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a picture set associated with an entity;

step 2, quantifying the importance of the triples (h, r, t) by using an unsupervised triplet screening module, and filtering part of invalid triples based on the importance scores, wherein h represents a head entity, t represents a tail entity, and r represents a relationship;

step 3, in a structural feature learning module, respectively learning the structural vectors of the entities of the two multi-modal knowledge maps by using a graph convolution neural network to generate structural feature representations of the respective entities;

step 4, respectively generating visual feature representations of respective entities in a visual feature processing module;

step 5, combining the entity structure characteristics and the entity visual characteristics of the two multi-modal knowledge maps to align the entities;

in the triple screening module, a relationship-entity graph, also called a relationship pair graph of a knowledge graph, is constructed by taking relationships as nodes and entities as edges, and a knowledge graph is definedSpectrum is G^e＝(V^e,E^e) In which V is^eAs a set of entities, E^eIs a set of relationships, a graph of relationship pairs G^rUsing relationships as nodes, if two different relationships have the same entity connection, there is an edge, V, between the two relationship nodes^rBeing a collection of relational nodes, E^rIs a collection of edges, a relational dual graph G^r＝(V^r,E^r) Based on the relational dual graph, a PageRank algorithm is used to calculate a relational score:

wherein PR (r) is the PageRank score for a relationship; b is_rA set of neighbor relations representing a relation r, the relation v ∈ B_rL (v) represents the number of connections of the relationship v;

the triple scoring function is thus calculated:

wherein d is_h，d_tRespectively representing the degrees of head and tail entities, namely the number of edges associated with the entities, scoring Score (h, r, t) based on the triples, setting a threshold value beta, reserving the triples with Score (h, r, t) > beta, and refining the knowledge graph.

Specifically, the visual feature processing module in step 3 comprises step 301 of generating similarity between a picture and an entity by using a pre-trained image-text matching model CVSE; step 302, setting a similarity threshold value to filter a noise picture; and 303, giving corresponding weight to the picture based on the similarity between the picture and the entity to generate visual characteristic representation of the entity.

Further, in step 301, a pre-trained image-text matching model is used to calculate similarity scores of respective images in the entity image set, and a pre-trained consensus perceptual visual semantic embedding model CVSE is used, wherein the CVSE model is input as entity e_iPicture embedding of p_iAnd text information t_iWherein the picture is embedded with p_iE.g. n multiplied by 36 multiplied by 204, 8n is the number of pictures in the picture set corresponding to the entity, 36 multiplied by 2048 is the feature vector dimension generated by the pre-trained target detection algorithm, fast-RCNN, for each picture, and the entity text information t of the model is input_iBy expanding the entity name into sentences: t is t_iObtained from { a photo of Entity Name }; then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:

Sim_v＝CVSE(p_i；t_i),

wherein the Softmax layer of the CVSE is removed, and the model is input as picture embedding pi and text information t_iGenerating similarity scores Sim of a plurality of pictures_vE is n multiplied by 1, n is the number of pictures in the picture set corresponding to the entity;

in step 302, a similarity threshold α is set to filter the noise picture:

set(i)'＝{j′|j′∈set(i),Sim_v(j′)＞α},

where set (i) represents the initial set of pictures, and set (i)' represents the set of pictures with noise filtered, Sim_v(j ') represents the similarity score of picture j' with the entity;

in step 303, entity e is generated_iMore accurate visual feature representation V_i:

V_i＝I'_i×Att_i,

Wherein, V_iE 1 x 2048 represents the visual characteristics of the entity i; i'_iE.n '× 2048 is the image characteristics generated by the Resnet model, n' is the number of pictures after noise removal, Att_iPicture attention weight is represented:

Att_i＝Softmax(Sim_v'),

wherein Sim_v'is the similarity score of the set of pictures set (i)'.

Specifically, the structural feature learning module in step 2 captures entity adjacent structure information by using a graph convolution neural network and generates an entity structural feature representation:

wherein H^l，H^l+1Respectively representing characteristic matrixes of entity nodes of the layer l and the layer l + 1;

representing a normalized adjacency matrix, D is a degree matrix,

wherein A represents an adjacency matrix, and if a relationship exists between entity i and entity j, then A_ij1 is ═ 1; i denotes identity matrix, activation function σ is set to ReLU, W^lIs a one-level trainable parameter matrix;

since the entity structure vectors of different knowledge-graphs are not in the same space, the entity structure vectors of different knowledge-graphs need to be mapped into the same space by using the known entity pair S, and the specific training goal is to minimize the following loss values:

wherein, (x)₊＝max{0,x}，

Representing a set of negative examples, based on known pairs of seed entities (e)₁,e₂) Replacing e with a random entity₁Or e₂And (4) generating. h is_eA structure vector representing the entity e is generated,

representing entity e₁And e₂The manhattan distance between samples, gamma represents the distance separating positive and negative samples, and random gradient descent is adopted for model optimization.

Further, step 4 is performed for each entity pair (e)₁,e₂)，e₁∈MG₁，e₂∈MG₂Calculating e₁And e₂And predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:

SIM(e₁,e₂)＝SIM_s(e₁,e₂)×Att_s+SIM_v(e₁,e₂)×Att_v,

SIM_s(e₁,e₂) And a SIM_v(e₁,e₂) Representing the similarity of the structural and visual feature representations, Att, respectively, of the entity_s、Att_vThe contribution rate weights representing the structural feature representation and the visual feature representation, respectively, are fixed weights or random weights.

Preferably, step 4 is performed for each entity pair (e)₁,e₂)，e₁∈MG₁，e₂∈MG₂Calculating e₁And e₂And predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:

SIM(e₁,e₂)＝SIM_s(e₁,e₂)×Att_s+SIM_v(e₁,e₂)×Att_v,

SIM_s(e₁,e₂) And a SIM_v(e₁,e₂) Representing the similarity of the structural and visual feature representations, Att, respectively, of the entity_s、Att_vRepresenting the contribution rate weights of the structural feature representation and the visual feature representation, respectively;

Att_v＝1-Att_s.

wherein K, b and a are hyper-parameters, d_egreeDegree of representation of an entity, N_hopRepresents the degree of closeness of the entity to the seed entity:

N_hop＝n_1-hop×w₁+lg(n_2-hop×w₂),

wherein n is_1-hop，n_2-hopRespectively representing the number of 1 hop and 2 hops away from the seed entity; w is a₁、w₂Is a hyper-parameter.

Compared with the prior art, the method has the advantages that: aiming at the problem of poor utilization of visual information, the work is based on a pre-training image-text matching model, calculates the similarity score of an entity-picture, filters a noise picture and obtains more accurate entity visual characteristic representation based on the similarity score; structural features and visual features of the variable attention fusion entity are used, complementarity of multi-mode information is fully utilized, and an alignment effect is improved; an innovative triple screening mechanism is designed, triple scores are generated based on the PageRank scores and the entity degrees, the triples are filtered, and the structural differences of different knowledge maps are alleviated.

Drawings

FIG. 1 shows a schematic flow diagram of an embodiment of the invention;

FIG. 2 illustrates a multi-modal entity alignment framework diagram of an embodiment of the present invention;

FIG. 3 shows a schematic flow chart of a visual feature processing module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 shows a multi-modal entity alignment method based on triplet screening fusion, comprising the following steps:

and 5, combining the entity structure characteristics and the entity visual characteristics of the two multi-mode knowledge maps to align the entities.

Multimodal knowledge-graphs typically contain information for multiple modalities. Without loss of generality, this work only focuses on structural and visual information of the knowledge-graph. Given two multimodal knowledge maps, including: MG₁＝(E₁,R₁,T₁,I₁) And MG₂＝(E₂,R₂,T₂,I₂) Wherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a collection of pictures associated with the entity. Set of pairs of seed entities

Representing a set of aligned pairs of entities for training. The multi-modal entity alignment task aims to find new entity pairs using known entity pair information and predict potential alignment results

Wherein the equal sign represents that two entities point to the real worldThe same entity.

Given an entity, the process of finding its corresponding entity in another knowledge-graph can be considered a ranking problem. That is, under a certain feature space, the degree of similarity (distance) of a given entity to all entities in another knowledge-graph is calculated and given an ordering, and the entity with the highest degree of similarity (distance is the smallest) can be regarded as an alignment result.

As shown in FIG. 2, the present invention first designs a multi-modal entity alignment framework: learning the structure vector of the entity by using the graph convolution neural network to generate the structure characteristic of the entity; designing a visual characteristic processing module to generate entity visual characteristics; and then combining the information of the two modes to carry out entity alignment based on an adaptive feature fusion mechanism. In addition, in order to mitigate the structural difference of the knowledge graph, the embodiment designs a triple screening mechanism, integrates the degree of the relation score and the entity, and filters part of the triples. MG in FIG. 2₁、MG₂Representing different multimodal knowledge maps; KG₁、KG₂Representing a knowledge-graph, KG₁' represents the knowledge-graph after processing by the triple filtering module.

Visual characteristic processing module

In order to solve the problem of poor visual information utilization of the multi-modal entity alignment method, inspired by an image-text matching model, the work designs a visual feature processing module to generate more accurate visual features for entities so as to help entity alignment. FIG. 3 details the generation of the visual characteristics of the entity. In the absence of supervision data, generating picture and entity similarity by adopting a pre-trained image-text matching model CVSE; then setting a similarity threshold value to filter the noise picture; and endowing the picture with corresponding weight based on the similarity score, and finally generating the visual characteristic representation of the entity.

A picture-entity similarity score is calculated. This step uses a pre-trained image-text matching model to calculate a similarity score for each picture in the entity picture set. Model parameters were obtained by training on the MSCOCO and Flickr30k datasets using a pre-trained Consensus perceptual Visual Semantic Embedding model CVSE (Consensus-aware Visual Semantic Embedding). Model conveyerInto entity e_iPicture embedding of p_iAnd text information t_i. Wherein the picture is embedded with p_iE.g., n is multiplied by 36 is multiplied by 2048, n is the number of pictures in the picture set corresponding to the entity, and 36 is multiplied by 2048 is the feature vector dimension generated by the pre-trained target detection algorithm, fast-RCNN, for each picture. Inputting entity text information t of model_iBy combining Entity Name]Expanding into sentences: t is t_iObtained from { a photo of Entity Name.

Then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:

Sim_v＝CVSE(p_i；t_i),

wherein the Softmax layer of the CVSE is removed, and the model is input as picture embedding pi and text information t_iGenerating similarity scores Sim of a plurality of pictures_vE is n multiplied by 1, and n is the number of pictures in the picture set corresponding to the entity.

And filtering the noise picture. The method considers that partial pictures with low similarity exist in the picture set of the entity, and the precision of the visual information is influenced. In view of this, a similarity threshold α is set to filter the noise picture:

set(i)'＝{j′|j′∈set(i),V_sim(j′)＞α},

where set (i) represents the initial picture set, and set (i)' represents the picture set after the noise picture is filtered out.

And generating the entity visual feature representation. Generating an entity picture set through a picture filtering mechanism, giving weight based on picture similarity scores, and finally generating an entity e_iMore accurate visual feature representation V_i:

V_i＝I'_i×Att_i,

Wherein, V_iE 1 x 2048 represents the visual characteristics of the entity i; i'_iE.n' x 2048 is Resnet [15 ]]And (4) generating image characteristics by using the model, wherein n' is the number of the pictures subjected to noise removal. Att_iPicture attention weight is represented:

Att_i＝Softmax(Sim_v'),

wherein Sim_v'is the similarity score of the set of pictures set (i)'.

Structural feature learning module

The present embodiment employs a graphical convolutional neural network (GCN) to capture entity adjacency structure information and generate entity structure representation vectors. The GCN is a convolutional network that acts directly on graph structure data, and generates corresponding node structure vectors by capturing structure information around nodes:

wherein H^l，H^l+1Respectively representing characteristic matrixes of nodes of the layer l and the layer l + 1;

representing a normalized adjacency matrix and D is a degree matrix.

Wherein A represents an adjacency matrix, and if a relationship exists between entity i and entity j, then A_ij1 is ═ 1; i denotes an identity matrix. The activation function σ is set to ReLU.

Since the entity structure vectors of different knowledge-graphs are not in the same space, it is necessary to map them into the same space using the known entity pair S. A specific training objective is to minimize the following loss values:

wherein, (x)₊＝max{0,x}，

representing entity e₁And e₂Manhattan distance between, and gamma represents the distance separating positive and negative samples. Model optimization was performed using a random gradient descent.

The multi-modal knowledge graph contains information of at least two modes, and multi-modal entity alignment needs to be fused with information of different modes. Existing approaches combine different embeddings into one unified representation space, which requires additional training to uniformly represent irrelevant features. A more preferable strategy is to first compute a similarity matrix within each feature-specific space and then combine the feature similarity scores.

Formally, given a structural feature vector representation S, a visual feature representation V. For each entity pair (e)₁,e₂)，e₁∈MG₁，e₂∈MG₂Calculating e₁And e₂And then using the similarity scores to predict potential alignment entities. To calculate the overall similarity, we first calculate a specific feature similarity score between the entity pair, i.e., SIM_s(e₁,e₂) And a SIM_v(e₁,e₂). Next, the above similarity scores are combined:

SIM(e₁,e₂)＝SIM_s(e₁,e₂)×Att_s+SIM_v(e₁,e₂)×Att_v,

wherein, Att_s、Att_vThe contribution rate weights of the structural information and the visual information are represented respectively, and the weights can be fixed weights or random weights or calculated weights.

The features of different modalities characterize the entity from different perspectives, with some correlation and complementarity. The current method combines the structural information and the visual information with a fixed contribution rate weight, and ignores the contribution rate difference of the structural information between different entities. For entities with poor structural information, the visual feature representation should be trusted more. Moreover, intuitively, the closeness of the association between the entity and the seed entity is positively correlated with the accuracy of its structural features.

In order to capture the dynamic change of the contribution rate of different modal information, the adaptive feature fusion mechanism is further designed by combining the closeness degree of the association between the entity and the seed entity on the basis of the entity degree and being inspired by a joint attention mechanism based on degree perception:

Att_v＝1-Att_s.

wherein K, b and a are hyper-parameters. N is a radical of_hopRepresents the degree of closeness of the entity to the seed entity:

N_hop＝n_1-hop×w₁+lg(n_2-hop×w₂),

Further, before performing step 2 to obtain the structural feature representation and step 3 to obtain the visual feature representation, the importance of the triples (h, r, t) is quantified using an unsupervised triple filtering module, and a portion of the invalid triples are filtered based on the importance scores.

The structural information of the knowledge graph is represented as a triplet, (h, r, t), wherein h represents a head entity, t represents a tail entity, and r represents a relationship. The difference in the number of different knowledge-graph triples is large, which results in a large discount on the effect of entity alignment based on structural information. In order to mitigate the structural differences of different knowledge maps, the work designs an unsupervised triple screening module, quantifies the importance of the triples, and filters partial invalid triples based on the importance scores. Wherein the triple importance score incorporates the PageRank score for the relationship r, and the degree of the entity h, t.

And calculating the PageRank score. Firstly, a relationship-entity graph, also called a relationship dual graph of the knowledge graph, is constructed, wherein the relationship is used as a node, and the entity is used as an edge. Defining the knowledge-graph as G^e＝(V^e,E^e) WhereinV^eAs a set of entities, E^eIs a set of relationships. Graph G of relationship pair^rTaking a relationship as a node, if two different relationships are connected by the same head entity (or tail entity), an edge exists between the two relationship nodes. V^rBeing a collection of relational nodes, E^rIs a collection of edges, a relational dual graph G^r＝(V^r,E^r).

Based on the above-generated relationship pair graph, the present embodiment calculates a relationship score using the PageRank algorithm. The PageRank algorithm is a representative algorithm for link analysis on graph data and belongs to an unsupervised learning method. The basic idea is to define a random walk model on the directed graph, describing the behavior of random walkers to randomly access each node along the directed graph. Under a certain condition, the probability of accessing each node in the limit condition converges to a stationary distribution, and the stationary probability value of each node is the PageRank value thereof and represents the importance of the node. Inspired by the algorithm, based on a knowledge graph relation dual graph, a PageRank value of the relation is calculated to represent the importance of the relation:

wherein PR (r) is the PageRank score for a relationship; b is_rA set of neighbor relations representing a relation r, the relation v ∈ B_rAnd L (v) represents the number of connections (i.e., degrees) of the relationship v.

A triple scoring mechanism. The screening of the triples is to filter out redundant or invalid relationships on one hand and protect the structural characteristics of the knowledge graph on the other hand. Since the long-tail entity with the lack of the structural information only has a few related triples, the problem of structural information lack of the long-tail entity can be aggravated if one relationship is directly filtered based on the relationship importance score. Therefore, the embodiment provides two triple scoring functions, one is to directly adopt PageRank scoring and design the triple scoring function:

Score(h,r,t)＝PR(r),

and the other method adopts an improved PageRank score, combines the PageRank score of the relationship and the degrees of head and tail entities, and designs a triple scoring function:

wherein d is_h，d_tRespectively, the degrees of the head and tail entities, i.e., the number of edges associated with the entities. And (4) based on the triple Score (h, r, t), setting a threshold value beta, reserving the triple Score (h, r, t) > beta, and refining the knowledge graph.

In the experiment, the data set MMKG was used in the present example and extracted from the knowledge bases FreeBase, DBpedia and Yago, respectively. These datasets are based on FB15K, aligning entities in FB15K with equivalent entities in other knowledge-maps using SameAs links between knowledge-maps, thereby generating DB15K and Yago 15K. Experiments were performed herein on two pairs of multimodal knowledge maps FB15K-DB15K and FB15K-YAGO 15K.

Since the data set does not provide pictures, in order to obtain entity-related pictures, this embodiment uses URI data and designs a web crawler to parse query results from Image Search engines (i.e., Google Images, Bing Images, and Yahoo Image Search). Then, pictures obtained by different search engines are distributed to different MMKGs. In order to simulate the construction process of a real-world multi-modal knowledge graph, pictures with high similarity in an equivalent entity image set are removed, and a certain number of noise pictures are introduced. Table 1 describes the details of the data set. In experiments, pairs of known equivalent entities are used for model training and testing.

TABLE 1 multimodal knowledge map statistics

Evaluation indexes are as follows: experiments used Hits @ k (k ═ 1,10) and Mean Reciprocal Rank (MRR) as evaluation indices. For each entity in the test set, the entities in the other graph are ranked in descending order according to their similarity score to the entity. Hits @ k represents the number of the first k entities that contain the correct entity as a percentage of the total number. On the other hand, MRR represents the average of the inverse ordering of the correctly aligned entities, Hits @1 represents the accuracy of the alignment and is the most important evaluation index, and Hits @10 and MRR provide supplementary information. Note that higher values for Hits @ k and MRR indicate better performance, and the results for Hits @ k are expressed as a percentage. We mark the best effect in bold in the table.

The experiment utilizes the graph convolution neural network to generate the entity structural feature, the quantity of negative examples is set to be 15, gamma is set to be 3, 400 rounds of training are carried out, and the dimension d is set_s300; the visual features are generated by a visual feature processing module, dimension d_v2048. Setting the proportion of the seed entities to be 20% and 50%, and selecting 10% of the entities as a verification set to be used for adjusting the hyperparameters in the formula, wherein b is 1.5, a is 1, the value of the parameter K is related to the proportion of the seed entities, and 0.6 is taken when seed is 0.2; when seed is 0.5, 0.8 is taken. For the hyperparameter w₁And w₂Take 0.8 and 0.1, respectively.

TABLE 2 Multi-modal entity alignment results

The method of the present embodiment and the method of removing the triple screening module in the method of the present embodiment are compared with 2 methods: (1) GCN-align, using GCN to generate entity structure and visual characteristic matrix, combining two characteristics with fixed weight to align entity; (2) and HMEA, generating a structure and visual characteristic matrix of the entity by using a hyperbolic convolution neural network (HGCN), and combining the structure characteristic and the visual characteristic in a hyperbolic space by weight to perform entity alignment. The method of the embodiment achieves the best multi-modal entity alignment effect at present.

In addition, to verify the validity of the triple screening module proposed by the present invention, we compared F_PageRank、F_Random、F_ourThree screening mechanisms, which respectively represent direct PageRank scoring screening, random screening and improved PageRank scoring screening. In order to control the experimental variables of the experiment,in the experiment, the same number of triples, about 29 ten thousand, were screened using the 3 screening mechanisms described above; the structural features are learned based on the graph convolution neural network, and all parameters are kept consistent.

The experimental results show that random screening F_RandomHits @1 was increased by about 1.5% and 2.5% in the case of seed 0.2 and 0.5, respectively, compared to the baseline with all triplets retained, indicating that there is some effect of pattern structure variability on entity alignment. Compared with random screening, the screening mechanism based on the PageRank score is improved by about 3% under the condition that the proportion of the seed entities is 50%. According to results, the improved triplet screening mechanism for PageRank score screening obtains the optimal alignment result, and Hits @1 of the triplet screening mechanism is respectively improved by more than 8% and 3% compared with a baseline on FB15K-DB 15K; on FB15K-Yago15K, the Hits @1 increased by about 9%, 5%, respectively.

Since the richness of the structural information is related to the degree of the entity, the entity is divided into three categories according to the number of the degree of the entity, and the accuracy of the multi-modal entity alignment under the adaptive fusion mechanism and the fixed weight mechanism provided by the embodiment is respectively tested on the three categories of entities. The seed entity ratio of the experiment is set to be 20 percent and is respectively carried out on FB15K-DB15K and FB15K-Yago15K, and the rest parameters are consistent with the experiment.

Table 3 shows the multi-modal entity alignment results of adaptive feature fusion and fixed weight fusion. Wherein Fixed and Adaptive represent a Fixed weight fusion mechanism and an Adaptive feature fusion mechanism respectively; group1, Group2 and Group3 respectively represent front 1/3, middle 1/3 and rear 1/3 partial entities, and are divided from small to large based on entity degrees. As can be seen from table 3, the adaptive feature fusion mechanism achieves better entity alignment effect on various entities than the fixed weight fusion. It can be clearly found that the improvement on Group1 is significantly higher than that of Group2 and Group3, which proves that the adaptive feature fusion mechanism of the embodiment can significantly improve the alignment accuracy of the short-structure-information entity, i.e., the long-tailed entity.

TABLE 3 adaptive feature fusion and fixed weight fusion multimodal entity alignment results

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A multi-modal entity alignment method based on triplet screening fusion is characterized by comprising the following steps:

in the triple screening module, firstly, a relation is constructed as a node,relationship-entity graph with entities as edges, also called relational-pair graph of knowledge graph, defining knowledge graph as G^e＝(V^e,E^e) In which V is^eAs a set of entities, E^eIs a set of relationships, a graph of relationship pairs G^rUsing relationships as nodes, if two different relationships have the same entity connection, there is an edge, V, between the two relationship nodes^rBeing a collection of relational nodes, E^rIs a collection of edges, a relational dual graph G^r＝(V^r,E^r) Based on the relational dual graph, a PageRank algorithm is used to calculate a relational score:

the triple scoring function is thus calculated:

2. The method of claim 1, wherein the visual feature processing module of step 3 comprises, in step 301, generating image-entity similarity using a pre-trained image-text matching model CVSE; step 302, setting a similarity threshold value to filter a noise picture; and 303, giving corresponding weight to the picture based on the similarity between the picture and the entity to generate visual characteristic representation of the entity.

3. The method as claimed in claim 2, wherein in step 301, a pre-trained image-text matching model is used to calculate similarity scores of each picture in the entity picture set, and a pre-trained consensus perceptual visual semantic embedding model CVSE is used, wherein the CVSE model is input as entity e_iPicture embedding of p_iAnd text information t_iWherein the picture is embedded with p_iE.g. n multiplied by 36 multiplied by 2048, n is the number of pictures in the picture set corresponding to the entity, 36 multiplied by 2048 is the feature vector dimension generated by the pre-trained target detection algorithm, fast-RCNN, for each picture, and the entity text information t of the model is input_iBy expanding the entity name into sentences: t is t_iObtained from { a photo of Entity Name }; then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:

Sim_v＝CVSE(p_i；t_i),

wherein the Softmax layer of the CVSE is removed and the model input is picture embedding p_iAnd text information t_iGenerating similarity scores Sim of a plurality of pictures_vE is n multiplied by 1, n is the number of pictures in the picture set corresponding to the entity;

in step 302, a similarity threshold α is set to filter the noise picture:

set(i)'＝{j′|j′∈set(i),Sim_v(j′)＞α},

V_i＝I'_i×Att_i,

Att_i＝Softmax(Sim_v'),

wherein Sim_v'is the similarity score of the set of pictures set (i)'.

4. The method according to claim 2 or 3, wherein the structural feature learning module in step 2 captures entity adjacent structure information by using a convolutional neural network and generates an entity structural feature representation:

representing a normalized adjacency matrix, D is a degree matrix,

wherein, (x)₊＝max{0,x}，

Representing a set of negative examples, based on known pairs of seed entities (e)₁,e₂) Replacing e with a random entity₁Or e₂Generation of h_eA structure vector representing the entity e is generated,

5. The method according to claim 4, wherein step 4 comprises (e) for each entity pair₁,e₂)，e₁∈MG₁，e₂∈MG₂Calculating e₁And e₂And predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:

SIM(e₁,e₂)＝SIM_s(e₁,e₂)×Att_s+SIM_v(e₁,e₂)×Att_v,

6. The method according to claim 4, wherein step 4 comprises (e) for each entity pair₁,e₂)，e₁∈MG₁，e₂∈MG₂Calculating e₁And e₂And predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:

SIM(e₁,e₂)＝SIM_s(e₁,e₂)×Att_s+SIM_v(e₁,e₂)×Att_v,

Att_v＝1-Att_s.

wherein K, b and a are hyper-parameters, degree represents the degree of an entity, N_hopRepresents the degree of closeness of the entity to the seed entity:

N_hop＝n_1-hop×w₁+lg(n_2-hop×w₂),