CN113656596A - Multi-modal entity alignment method based on triple screening fusion - Google Patents

Multi-modal entity alignment method based on triple screening fusion Download PDF

Info

Publication number
CN113656596A
CN113656596A CN202110950895.5A CN202110950895A CN113656596A CN 113656596 A CN113656596 A CN 113656596A CN 202110950895 A CN202110950895 A CN 202110950895A CN 113656596 A CN113656596 A CN 113656596A
Authority
CN
China
Prior art keywords
entity
picture
sim
entities
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110950895.5A
Other languages
Chinese (zh)
Other versions
CN113656596B (en
Inventor
唐九阳
郭浩
赵翔
曾维新
刘丽
郭延明
肖卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110950895.5A priority Critical patent/CN113656596B/en
Publication of CN113656596A publication Critical patent/CN113656596A/en
Application granted granted Critical
Publication of CN113656596B publication Critical patent/CN113656596B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-modal entity alignment method based on triple screening fusion, which comprises the following steps: acquiring data of two multi-mode knowledge maps; quantifying the importance of the triples by using an unsupervised triple screening module, and filtering part of invalid triples based on the importance scores; respectively learning the structure vectors of the entities of the two multi-mode knowledge maps by using a graph convolutional neural network to generate the structural feature representation of each entity; respectively generating visual feature representations of the respective entities; and combining the entity structural features and the entity visual features of the two multi-modal knowledge maps to perform entity alignment. Aiming at the problem of poor utilization of visual information, the similarity score of an entity-picture is calculated, and more accurate entity visual characteristic representation is obtained based on the similarity; generating a triple score based on the PageRank score and the entity degree, filtering the triples, and mitigating the structural difference of different knowledge maps so that the alignment effect is better.

Description

Multi-modal entity alignment method based on triple screening fusion
Technical Field
The invention relates to the technical field of knowledge maps in natural language processing, in particular to a multi-modal entity alignment method based on triplet screening fusion.
Background
In recent years, knowledge maps have become a widely used representation of structured data. It represents real-world knowledge or events in the form of triples and is widely used for various types of artificial intelligence downstream tasks. At present, a multi-modal knowledge graph is often constructed from limited data sources, and the problems of information loss and low coverage rate exist, so that the knowledge utilization rate is not high. Considering that manual completion of a knowledge graph is costly and inefficient, one possible approach to improve the coverage of knowledge graphs is to automatically integrate useful knowledge from other knowledge graphs. The entity is used as a hub for linking different knowledge graphs and is very important for integrating various multi-modal knowledge graphs. The process of identifying entities in different multimodal knowledge maps that express the same meaning is referred to as multimodal entity alignment.
Multi-modal entity alignment requires the utilization and fusion of information for multiple modalities. However, existing multi-modal entity alignment methods encounter two bottlenecks: first, pattern structure variability is difficult to handle. Based on the assumption that peer entities in different knowledge graphs usually have peer neighbor entities, the current mainstream entity alignment method mainly depends on the structural information of the knowledge graph. However, in the real world, due to different construction modes, different knowledge maps may have large structural differences. For such problems, triples can be generated based on link prediction to enrich structural information, and although the problem of structural diversity is alleviated to a certain extent, the reliability of the generated triples needs to be considered, and the completion difficulty is high for the case that the number of the triples is different by multiple times. Second, visual information utilizes the difference. Current automated methods of constructing multimodal knowledgemaps typically complement information of other modalities based on existing knowledgemaps. To obtain visual information, these methods mainly utilize crawlers to obtain pictures related to the entity from the internet. However, a picture with a low degree of partial correlation, i.e., a noise picture, inevitably exists in the acquired picture. The current method cannot distinguish a noise picture in a picture related to an entity, so that part of noise is mixed in visual information of the entity, and the accuracy rate of the visual information for entity alignment is further reduced.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention discloses a multi-modal entity alignment method based on triple screening fusion.
The technical scheme of the invention is that a multi-modal entity alignment method based on triple screening fusion comprises the following steps:
step 1, acquiring data of two multi-modal knowledge maps, MG1=(E1,R1,T1,I1) And MG2=(E2,R2,T2,I2) Wherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a picture set associated with an entity;
step 2, quantifying the importance of the triples (h, r, t) by using an unsupervised triplet screening module, and filtering part of invalid triples based on the importance scores, wherein h represents a head entity, t represents a tail entity, and r represents a relationship;
step 3, in a structural feature learning module, respectively learning the structural vectors of the entities of the two multi-modal knowledge maps by using a graph convolution neural network to generate structural feature representations of the respective entities;
step 4, respectively generating visual feature representations of respective entities in a visual feature processing module;
step 5, combining the entity structure characteristics and the entity visual characteristics of the two multi-modal knowledge maps to align the entities;
in the triple screening module, a relationship-entity graph, also called a relationship pair graph of a knowledge graph, is constructed by taking relationships as nodes and entities as edges, and a knowledge graph is definedSpectrum is Ge=(Ve,Ee) In which V iseAs a set of entities, EeIs a set of relationships, a graph of relationship pairs GrUsing relationships as nodes, if two different relationships have the same entity connection, there is an edge, V, between the two relationship nodesrBeing a collection of relational nodes, ErIs a collection of edges, a relational dual graph Gr=(Vr,Er) Based on the relational dual graph, a PageRank algorithm is used to calculate a relational score:
Figure BDA0003218418430000031
wherein PR (r) is the PageRank score for a relationship; b isrA set of neighbor relations representing a relation r, the relation v ∈ BrL (v) represents the number of connections of the relationship v;
the triple scoring function is thus calculated:
Figure BDA0003218418430000032
wherein d ish,dtRespectively representing the degrees of head and tail entities, namely the number of edges associated with the entities, scoring Score (h, r, t) based on the triples, setting a threshold value beta, reserving the triples with Score (h, r, t) > beta, and refining the knowledge graph.
Specifically, the visual feature processing module in step 3 comprises step 301 of generating similarity between a picture and an entity by using a pre-trained image-text matching model CVSE; step 302, setting a similarity threshold value to filter a noise picture; and 303, giving corresponding weight to the picture based on the similarity between the picture and the entity to generate visual characteristic representation of the entity.
Further, in step 301, a pre-trained image-text matching model is used to calculate similarity scores of respective images in the entity image set, and a pre-trained consensus perceptual visual semantic embedding model CVSE is used, wherein the CVSE model is input as entity eiPicture embedding of piAnd text information tiWherein the picture is embedded with piE.g. n multiplied by 36 multiplied by 204, 8n is the number of pictures in the picture set corresponding to the entity, 36 multiplied by 2048 is the feature vector dimension generated by the pre-trained target detection algorithm, fast-RCNN, for each picture, and the entity text information t of the model is inputiBy expanding the entity name into sentences: t is tiObtained from { a photo of Entity Name }; then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:
Simv=CVSE(pi;ti),
wherein the Softmax layer of the CVSE is removed, and the model is input as picture embedding pi and text information tiGenerating similarity scores Sim of a plurality of picturesvE is n multiplied by 1, n is the number of pictures in the picture set corresponding to the entity;
in step 302, a similarity threshold α is set to filter the noise picture:
set(i)'={j′|j′∈set(i),Simv(j′)>α},
where set (i) represents the initial set of pictures, and set (i)' represents the set of pictures with noise filtered, Simv(j ') represents the similarity score of picture j' with the entity;
in step 303, entity e is generatediMore accurate visual feature representation Vi:
Vi=I'i×Atti,
Wherein, ViE 1 x 2048 represents the visual characteristics of the entity i; i'iE.n '× 2048 is the image characteristics generated by the Resnet model, n' is the number of pictures after noise removal, AttiPicture attention weight is represented:
Atti=Softmax(Simv'),
wherein Simv'is the similarity score of the set of pictures set (i)'.
Specifically, the structural feature learning module in step 2 captures entity adjacent structure information by using a graph convolution neural network and generates an entity structural feature representation:
Figure BDA0003218418430000041
wherein Hl,Hl+1Respectively representing characteristic matrixes of entity nodes of the layer l and the layer l + 1;
Figure BDA0003218418430000042
representing a normalized adjacency matrix, D is a degree matrix,
Figure BDA0003218418430000043
wherein A represents an adjacency matrix, and if a relationship exists between entity i and entity j, then Aij1 is ═ 1; i denotes identity matrix, activation function σ is set to ReLU, WlIs a one-level trainable parameter matrix;
since the entity structure vectors of different knowledge-graphs are not in the same space, the entity structure vectors of different knowledge-graphs need to be mapped into the same space by using the known entity pair S, and the specific training goal is to minimize the following loss values:
Figure BDA0003218418430000051
wherein, (x)+=max{0,x},
Figure BDA0003218418430000052
Representing a set of negative examples, based on known pairs of seed entities (e)1,e2) Replacing e with a random entity1Or e2And (4) generating. h iseA structure vector representing the entity e is generated,
Figure BDA0003218418430000053
representing entity e1And e2The manhattan distance between samples, gamma represents the distance separating positive and negative samples, and random gradient descent is adopted for model optimization.
Further, step 4 is performed for each entity pair (e)1,e2),e1∈MG1,e2∈MG2Calculating e1And e2And predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:
SIM(e1,e2)=SIMs(e1,e2)×Atts+SIMv(e1,e2)×Attv,
SIMs(e1,e2) And a SIMv(e1,e2) Representing the similarity of the structural and visual feature representations, Att, respectively, of the entitys、AttvThe contribution rate weights representing the structural feature representation and the visual feature representation, respectively, are fixed weights or random weights.
Preferably, step 4 is performed for each entity pair (e)1,e2),e1∈MG1,e2∈MG2Calculating e1And e2And predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:
SIM(e1,e2)=SIMs(e1,e2)×Atts+SIMv(e1,e2)×Attv,
SIMs(e1,e2) And a SIMv(e1,e2) Representing the similarity of the structural and visual feature representations, Att, respectively, of the entitys、AttvRepresenting the contribution rate weights of the structural feature representation and the visual feature representation, respectively;
Figure BDA0003218418430000054
Attv=1-Atts.
wherein K, b and a are hyper-parameters, degreeDegree of representation of an entity, NhopRepresents the degree of closeness of the entity to the seed entity:
Nhop=n1-hop×w1+lg(n2-hop×w2),
wherein n is1-hop,n2-hopRespectively representing the number of 1 hop and 2 hops away from the seed entity; w is a1、w2Is a hyper-parameter.
Compared with the prior art, the method has the advantages that: aiming at the problem of poor utilization of visual information, the work is based on a pre-training image-text matching model, calculates the similarity score of an entity-picture, filters a noise picture and obtains more accurate entity visual characteristic representation based on the similarity score; structural features and visual features of the variable attention fusion entity are used, complementarity of multi-mode information is fully utilized, and an alignment effect is improved; an innovative triple screening mechanism is designed, triple scores are generated based on the PageRank scores and the entity degrees, the triples are filtered, and the structural differences of different knowledge maps are alleviated.
Drawings
FIG. 1 shows a schematic flow diagram of an embodiment of the invention;
FIG. 2 illustrates a multi-modal entity alignment framework diagram of an embodiment of the present invention;
FIG. 3 shows a schematic flow chart of a visual feature processing module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 shows a multi-modal entity alignment method based on triplet screening fusion, comprising the following steps:
step 1, acquiring data of two multi-modal knowledge maps, MG1=(E1,R1,T1,I1) And MG2=(E2,R2,T2,I2) Wherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a picture set associated with an entity;
step 2, quantifying the importance of the triples (h, r, t) by using an unsupervised triplet screening module, and filtering part of invalid triples based on the importance scores, wherein h represents a head entity, t represents a tail entity, and r represents a relationship;
step 3, in a structural feature learning module, respectively learning the structural vectors of the entities of the two multi-modal knowledge maps by using a graph convolution neural network to generate structural feature representations of the respective entities;
step 4, respectively generating visual feature representations of respective entities in a visual feature processing module;
and 5, combining the entity structure characteristics and the entity visual characteristics of the two multi-mode knowledge maps to align the entities.
Multimodal knowledge-graphs typically contain information for multiple modalities. Without loss of generality, this work only focuses on structural and visual information of the knowledge-graph. Given two multimodal knowledge maps, including: MG1=(E1,R1,T1,I1) And MG2=(E2,R2,T2,I2) Wherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a collection of pictures associated with the entity. Set of pairs of seed entities
Figure BDA0003218418430000071
Representing a set of aligned pairs of entities for training. The multi-modal entity alignment task aims to find new entity pairs using known entity pair information and predict potential alignment results
Figure BDA0003218418430000072
Wherein the equal sign represents that two entities point to the real worldThe same entity.
Given an entity, the process of finding its corresponding entity in another knowledge-graph can be considered a ranking problem. That is, under a certain feature space, the degree of similarity (distance) of a given entity to all entities in another knowledge-graph is calculated and given an ordering, and the entity with the highest degree of similarity (distance is the smallest) can be regarded as an alignment result.
As shown in FIG. 2, the present invention first designs a multi-modal entity alignment framework: learning the structure vector of the entity by using the graph convolution neural network to generate the structure characteristic of the entity; designing a visual characteristic processing module to generate entity visual characteristics; and then combining the information of the two modes to carry out entity alignment based on an adaptive feature fusion mechanism. In addition, in order to mitigate the structural difference of the knowledge graph, the embodiment designs a triple screening mechanism, integrates the degree of the relation score and the entity, and filters part of the triples. MG in FIG. 21、MG2Representing different multimodal knowledge maps; KG1、KG2Representing a knowledge-graph, KG1' represents the knowledge-graph after processing by the triple filtering module.
Visual characteristic processing module
In order to solve the problem of poor visual information utilization of the multi-modal entity alignment method, inspired by an image-text matching model, the work designs a visual feature processing module to generate more accurate visual features for entities so as to help entity alignment. FIG. 3 details the generation of the visual characteristics of the entity. In the absence of supervision data, generating picture and entity similarity by adopting a pre-trained image-text matching model CVSE; then setting a similarity threshold value to filter the noise picture; and endowing the picture with corresponding weight based on the similarity score, and finally generating the visual characteristic representation of the entity.
A picture-entity similarity score is calculated. This step uses a pre-trained image-text matching model to calculate a similarity score for each picture in the entity picture set. Model parameters were obtained by training on the MSCOCO and Flickr30k datasets using a pre-trained Consensus perceptual Visual Semantic Embedding model CVSE (Consensus-aware Visual Semantic Embedding). Model conveyerInto entity eiPicture embedding of piAnd text information ti. Wherein the picture is embedded with piE.g., n is multiplied by 36 is multiplied by 2048, n is the number of pictures in the picture set corresponding to the entity, and 36 is multiplied by 2048 is the feature vector dimension generated by the pre-trained target detection algorithm, fast-RCNN, for each picture. Inputting entity text information t of modeliBy combining Entity Name]Expanding into sentences: t is tiObtained from { a photo of Entity Name.
Then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:
Simv=CVSE(pi;ti),
wherein the Softmax layer of the CVSE is removed, and the model is input as picture embedding pi and text information tiGenerating similarity scores Sim of a plurality of picturesvE is n multiplied by 1, and n is the number of pictures in the picture set corresponding to the entity.
And filtering the noise picture. The method considers that partial pictures with low similarity exist in the picture set of the entity, and the precision of the visual information is influenced. In view of this, a similarity threshold α is set to filter the noise picture:
set(i)'={j′|j′∈set(i),Vsim(j′)>α},
where set (i) represents the initial picture set, and set (i)' represents the picture set after the noise picture is filtered out.
And generating the entity visual feature representation. Generating an entity picture set through a picture filtering mechanism, giving weight based on picture similarity scores, and finally generating an entity eiMore accurate visual feature representation Vi:
Vi=I'i×Atti,
Wherein, ViE 1 x 2048 represents the visual characteristics of the entity i; i'iE.n' x 2048 is Resnet [15 ]]And (4) generating image characteristics by using the model, wherein n' is the number of the pictures subjected to noise removal. AttiPicture attention weight is represented:
Atti=Softmax(Simv'),
wherein Simv'is the similarity score of the set of pictures set (i)'.
Structural feature learning module
The present embodiment employs a graphical convolutional neural network (GCN) to capture entity adjacency structure information and generate entity structure representation vectors. The GCN is a convolutional network that acts directly on graph structure data, and generates corresponding node structure vectors by capturing structure information around nodes:
Figure BDA0003218418430000091
wherein Hl,Hl+1Respectively representing characteristic matrixes of nodes of the layer l and the layer l + 1;
Figure BDA0003218418430000092
representing a normalized adjacency matrix and D is a degree matrix.
Figure BDA0003218418430000093
Wherein A represents an adjacency matrix, and if a relationship exists between entity i and entity j, then Aij1 is ═ 1; i denotes an identity matrix. The activation function σ is set to ReLU.
Since the entity structure vectors of different knowledge-graphs are not in the same space, it is necessary to map them into the same space using the known entity pair S. A specific training objective is to minimize the following loss values:
Figure BDA0003218418430000101
wherein, (x)+=max{0,x},
Figure BDA0003218418430000102
Representing a set of negative examples, based on known pairs of seed entities (e)1,e2) Replacing e with a random entity1Or e2And (4) generating. h iseA structure vector representing the entity e is generated,
Figure BDA0003218418430000103
representing entity e1And e2Manhattan distance between, and gamma represents the distance separating positive and negative samples. Model optimization was performed using a random gradient descent.
The multi-modal knowledge graph contains information of at least two modes, and multi-modal entity alignment needs to be fused with information of different modes. Existing approaches combine different embeddings into one unified representation space, which requires additional training to uniformly represent irrelevant features. A more preferable strategy is to first compute a similarity matrix within each feature-specific space and then combine the feature similarity scores.
Formally, given a structural feature vector representation S, a visual feature representation V. For each entity pair (e)1,e2),e1∈MG1,e2∈MG2Calculating e1And e2And then using the similarity scores to predict potential alignment entities. To calculate the overall similarity, we first calculate a specific feature similarity score between the entity pair, i.e., SIMs(e1,e2) And a SIMv(e1,e2). Next, the above similarity scores are combined:
SIM(e1,e2)=SIMs(e1,e2)×Atts+SIMv(e1,e2)×Attv,
wherein, Atts、AttvThe contribution rate weights of the structural information and the visual information are represented respectively, and the weights can be fixed weights or random weights or calculated weights.
The features of different modalities characterize the entity from different perspectives, with some correlation and complementarity. The current method combines the structural information and the visual information with a fixed contribution rate weight, and ignores the contribution rate difference of the structural information between different entities. For entities with poor structural information, the visual feature representation should be trusted more. Moreover, intuitively, the closeness of the association between the entity and the seed entity is positively correlated with the accuracy of its structural features.
In order to capture the dynamic change of the contribution rate of different modal information, the adaptive feature fusion mechanism is further designed by combining the closeness degree of the association between the entity and the seed entity on the basis of the entity degree and being inspired by a joint attention mechanism based on degree perception:
Figure BDA0003218418430000111
Attv=1-Atts.
wherein K, b and a are hyper-parameters. N is a radical ofhopRepresents the degree of closeness of the entity to the seed entity:
Nhop=n1-hop×w1+lg(n2-hop×w2),
wherein n is1-hop,n2-hopRespectively representing the number of 1 hop and 2 hops away from the seed entity; w is a1、w2Is a hyper-parameter.
Further, before performing step 2 to obtain the structural feature representation and step 3 to obtain the visual feature representation, the importance of the triples (h, r, t) is quantified using an unsupervised triple filtering module, and a portion of the invalid triples are filtered based on the importance scores.
The structural information of the knowledge graph is represented as a triplet, (h, r, t), wherein h represents a head entity, t represents a tail entity, and r represents a relationship. The difference in the number of different knowledge-graph triples is large, which results in a large discount on the effect of entity alignment based on structural information. In order to mitigate the structural differences of different knowledge maps, the work designs an unsupervised triple screening module, quantifies the importance of the triples, and filters partial invalid triples based on the importance scores. Wherein the triple importance score incorporates the PageRank score for the relationship r, and the degree of the entity h, t.
And calculating the PageRank score. Firstly, a relationship-entity graph, also called a relationship dual graph of the knowledge graph, is constructed, wherein the relationship is used as a node, and the entity is used as an edge. Defining the knowledge-graph as Ge=(Ve,Ee) WhereinVeAs a set of entities, EeIs a set of relationships. Graph G of relationship pairrTaking a relationship as a node, if two different relationships are connected by the same head entity (or tail entity), an edge exists between the two relationship nodes. VrBeing a collection of relational nodes, ErIs a collection of edges, a relational dual graph Gr=(Vr,Er).
Based on the above-generated relationship pair graph, the present embodiment calculates a relationship score using the PageRank algorithm. The PageRank algorithm is a representative algorithm for link analysis on graph data and belongs to an unsupervised learning method. The basic idea is to define a random walk model on the directed graph, describing the behavior of random walkers to randomly access each node along the directed graph. Under a certain condition, the probability of accessing each node in the limit condition converges to a stationary distribution, and the stationary probability value of each node is the PageRank value thereof and represents the importance of the node. Inspired by the algorithm, based on a knowledge graph relation dual graph, a PageRank value of the relation is calculated to represent the importance of the relation:
Figure BDA0003218418430000121
wherein PR (r) is the PageRank score for a relationship; b isrA set of neighbor relations representing a relation r, the relation v ∈ BrAnd L (v) represents the number of connections (i.e., degrees) of the relationship v.
A triple scoring mechanism. The screening of the triples is to filter out redundant or invalid relationships on one hand and protect the structural characteristics of the knowledge graph on the other hand. Since the long-tail entity with the lack of the structural information only has a few related triples, the problem of structural information lack of the long-tail entity can be aggravated if one relationship is directly filtered based on the relationship importance score. Therefore, the embodiment provides two triple scoring functions, one is to directly adopt PageRank scoring and design the triple scoring function:
Score(h,r,t)=PR(r),
and the other method adopts an improved PageRank score, combines the PageRank score of the relationship and the degrees of head and tail entities, and designs a triple scoring function:
Figure BDA0003218418430000122
wherein d ish,dtRespectively, the degrees of the head and tail entities, i.e., the number of edges associated with the entities. And (4) based on the triple Score (h, r, t), setting a threshold value beta, reserving the triple Score (h, r, t) > beta, and refining the knowledge graph.
In the experiment, the data set MMKG was used in the present example and extracted from the knowledge bases FreeBase, DBpedia and Yago, respectively. These datasets are based on FB15K, aligning entities in FB15K with equivalent entities in other knowledge-maps using SameAs links between knowledge-maps, thereby generating DB15K and Yago 15K. Experiments were performed herein on two pairs of multimodal knowledge maps FB15K-DB15K and FB15K-YAGO 15K.
Since the data set does not provide pictures, in order to obtain entity-related pictures, this embodiment uses URI data and designs a web crawler to parse query results from Image Search engines (i.e., Google Images, Bing Images, and Yahoo Image Search). Then, pictures obtained by different search engines are distributed to different MMKGs. In order to simulate the construction process of a real-world multi-modal knowledge graph, pictures with high similarity in an equivalent entity image set are removed, and a certain number of noise pictures are introduced. Table 1 describes the details of the data set. In experiments, pairs of known equivalent entities are used for model training and testing.
TABLE 1 multimodal knowledge map statistics
Figure BDA0003218418430000131
Evaluation indexes are as follows: experiments used Hits @ k (k ═ 1,10) and Mean Reciprocal Rank (MRR) as evaluation indices. For each entity in the test set, the entities in the other graph are ranked in descending order according to their similarity score to the entity. Hits @ k represents the number of the first k entities that contain the correct entity as a percentage of the total number. On the other hand, MRR represents the average of the inverse ordering of the correctly aligned entities, Hits @1 represents the accuracy of the alignment and is the most important evaluation index, and Hits @10 and MRR provide supplementary information. Note that higher values for Hits @ k and MRR indicate better performance, and the results for Hits @ k are expressed as a percentage. We mark the best effect in bold in the table.
The experiment utilizes the graph convolution neural network to generate the entity structural feature, the quantity of negative examples is set to be 15, gamma is set to be 3, 400 rounds of training are carried out, and the dimension d is sets300; the visual features are generated by a visual feature processing module, dimension dv2048. Setting the proportion of the seed entities to be 20% and 50%, and selecting 10% of the entities as a verification set to be used for adjusting the hyperparameters in the formula, wherein b is 1.5, a is 1, the value of the parameter K is related to the proportion of the seed entities, and 0.6 is taken when seed is 0.2; when seed is 0.5, 0.8 is taken. For the hyperparameter w1And w2Take 0.8 and 0.1, respectively.
TABLE 2 Multi-modal entity alignment results
Figure BDA0003218418430000141
The method of the present embodiment and the method of removing the triple screening module in the method of the present embodiment are compared with 2 methods: (1) GCN-align, using GCN to generate entity structure and visual characteristic matrix, combining two characteristics with fixed weight to align entity; (2) and HMEA, generating a structure and visual characteristic matrix of the entity by using a hyperbolic convolution neural network (HGCN), and combining the structure characteristic and the visual characteristic in a hyperbolic space by weight to perform entity alignment. The method of the embodiment achieves the best multi-modal entity alignment effect at present.
In addition, to verify the validity of the triple screening module proposed by the present invention, we compared FPageRank、FRandom、FourThree screening mechanisms, which respectively represent direct PageRank scoring screening, random screening and improved PageRank scoring screening. In order to control the experimental variables of the experiment,in the experiment, the same number of triples, about 29 ten thousand, were screened using the 3 screening mechanisms described above; the structural features are learned based on the graph convolution neural network, and all parameters are kept consistent.
The experimental results show that random screening FRandomHits @1 was increased by about 1.5% and 2.5% in the case of seed 0.2 and 0.5, respectively, compared to the baseline with all triplets retained, indicating that there is some effect of pattern structure variability on entity alignment. Compared with random screening, the screening mechanism based on the PageRank score is improved by about 3% under the condition that the proportion of the seed entities is 50%. According to results, the improved triplet screening mechanism for PageRank score screening obtains the optimal alignment result, and Hits @1 of the triplet screening mechanism is respectively improved by more than 8% and 3% compared with a baseline on FB15K-DB 15K; on FB15K-Yago15K, the Hits @1 increased by about 9%, 5%, respectively.
Since the richness of the structural information is related to the degree of the entity, the entity is divided into three categories according to the number of the degree of the entity, and the accuracy of the multi-modal entity alignment under the adaptive fusion mechanism and the fixed weight mechanism provided by the embodiment is respectively tested on the three categories of entities. The seed entity ratio of the experiment is set to be 20 percent and is respectively carried out on FB15K-DB15K and FB15K-Yago15K, and the rest parameters are consistent with the experiment.
Table 3 shows the multi-modal entity alignment results of adaptive feature fusion and fixed weight fusion. Wherein Fixed and Adaptive represent a Fixed weight fusion mechanism and an Adaptive feature fusion mechanism respectively; group1, Group2 and Group3 respectively represent front 1/3, middle 1/3 and rear 1/3 partial entities, and are divided from small to large based on entity degrees. As can be seen from table 3, the adaptive feature fusion mechanism achieves better entity alignment effect on various entities than the fixed weight fusion. It can be clearly found that the improvement on Group1 is significantly higher than that of Group2 and Group3, which proves that the adaptive feature fusion mechanism of the embodiment can significantly improve the alignment accuracy of the short-structure-information entity, i.e., the long-tailed entity.
TABLE 3 adaptive feature fusion and fixed weight fusion multimodal entity alignment results
Figure BDA0003218418430000151
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (6)

1. A multi-modal entity alignment method based on triplet screening fusion is characterized by comprising the following steps:
step 1, acquiring data of two multi-modal knowledge maps, MG1=(E1,R1,T1,I1) And MG2=(E2,R2,T2,I2) Wherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a picture set associated with an entity;
step 2, quantifying the importance of the triples (h, r, t) by using an unsupervised triplet screening module, and filtering part of invalid triples based on the importance scores, wherein h represents a head entity, t represents a tail entity, and r represents a relationship;
step 3, in a structural feature learning module, respectively learning the structural vectors of the entities of the two multi-modal knowledge maps by using a graph convolution neural network to generate structural feature representations of the respective entities;
step 4, respectively generating visual feature representations of respective entities in a visual feature processing module;
step 5, combining the entity structure characteristics and the entity visual characteristics of the two multi-modal knowledge maps to align the entities;
in the triple screening module, firstly, a relation is constructed as a node,relationship-entity graph with entities as edges, also called relational-pair graph of knowledge graph, defining knowledge graph as Ge=(Ve,Ee) In which V iseAs a set of entities, EeIs a set of relationships, a graph of relationship pairs GrUsing relationships as nodes, if two different relationships have the same entity connection, there is an edge, V, between the two relationship nodesrBeing a collection of relational nodes, ErIs a collection of edges, a relational dual graph Gr=(Vr,Er) Based on the relational dual graph, a PageRank algorithm is used to calculate a relational score:
Figure FDA0003218418420000011
wherein PR (r) is the PageRank score for a relationship; b isrA set of neighbor relations representing a relation r, the relation v ∈ BrL (v) represents the number of connections of the relationship v;
the triple scoring function is thus calculated:
Figure FDA0003218418420000021
wherein d ish,dtRespectively representing the degrees of head and tail entities, namely the number of edges associated with the entities, scoring Score (h, r, t) based on the triples, setting a threshold value beta, reserving the triples with Score (h, r, t) > beta, and refining the knowledge graph.
2. The method of claim 1, wherein the visual feature processing module of step 3 comprises, in step 301, generating image-entity similarity using a pre-trained image-text matching model CVSE; step 302, setting a similarity threshold value to filter a noise picture; and 303, giving corresponding weight to the picture based on the similarity between the picture and the entity to generate visual characteristic representation of the entity.
3. The method as claimed in claim 2, wherein in step 301, a pre-trained image-text matching model is used to calculate similarity scores of each picture in the entity picture set, and a pre-trained consensus perceptual visual semantic embedding model CVSE is used, wherein the CVSE model is input as entity eiPicture embedding of piAnd text information tiWherein the picture is embedded with piE.g. n multiplied by 36 multiplied by 2048, n is the number of pictures in the picture set corresponding to the entity, 36 multiplied by 2048 is the feature vector dimension generated by the pre-trained target detection algorithm, fast-RCNN, for each picture, and the entity text information t of the model is inputiBy expanding the entity name into sentences: t is tiObtained from { a photo of Entity Name }; then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:
Simv=CVSE(pi;ti),
wherein the Softmax layer of the CVSE is removed and the model input is picture embedding piAnd text information tiGenerating similarity scores Sim of a plurality of picturesvE is n multiplied by 1, n is the number of pictures in the picture set corresponding to the entity;
in step 302, a similarity threshold α is set to filter the noise picture:
set(i)'={j′|j′∈set(i),Simv(j′)>α},
where set (i) represents the initial set of pictures, and set (i)' represents the set of pictures with noise filtered, Simv(j ') represents the similarity score of picture j' with the entity;
in step 303, entity e is generatediMore accurate visual feature representation Vi:
Vi=I'i×Atti,
Wherein, ViE 1 x 2048 represents the visual characteristics of the entity i; i'iE.n '× 2048 is the image characteristics generated by the Resnet model, n' is the number of pictures after noise removal, AttiPicture attention weight is represented:
Atti=Softmax(Simv'),
wherein Simv'is the similarity score of the set of pictures set (i)'.
4. The method according to claim 2 or 3, wherein the structural feature learning module in step 2 captures entity adjacent structure information by using a convolutional neural network and generates an entity structural feature representation:
Figure FDA0003218418420000031
wherein Hl,Hl+1Respectively representing characteristic matrixes of entity nodes of the layer l and the layer l + 1;
Figure FDA0003218418420000032
representing a normalized adjacency matrix, D is a degree matrix,
Figure FDA0003218418420000033
wherein A represents an adjacency matrix, and if a relationship exists between entity i and entity j, then Aij1 is ═ 1; i denotes identity matrix, activation function σ is set to ReLU, WlIs a one-level trainable parameter matrix;
since the entity structure vectors of different knowledge-graphs are not in the same space, the entity structure vectors of different knowledge-graphs need to be mapped into the same space by using the known entity pair S, and the specific training goal is to minimize the following loss values:
Figure FDA0003218418420000034
wherein, (x)+=max{0,x},
Figure FDA0003218418420000035
Representing a set of negative examples, based on known pairs of seed entities (e)1,e2) Replacing e with a random entity1Or e2Generation of heA structure vector representing the entity e is generated,
Figure FDA0003218418420000041
representing entity e1And e2The manhattan distance between samples, gamma represents the distance separating positive and negative samples, and random gradient descent is adopted for model optimization.
5. The method according to claim 4, wherein step 4 comprises (e) for each entity pair1,e2),e1∈MG1,e2∈MG2Calculating e1And e2And predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:
SIM(e1,e2)=SIMs(e1,e2)×Atts+SIMv(e1,e2)×Attv,
SIMs(e1,e2) And a SIMv(e1,e2) Representing the similarity of the structural and visual feature representations, Att, respectively, of the entitys、AttvThe contribution rate weights representing the structural feature representation and the visual feature representation, respectively, are fixed weights or random weights.
6. The method according to claim 4, wherein step 4 comprises (e) for each entity pair1,e2),e1∈MG1,e2∈MG2Calculating e1And e2And predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:
SIM(e1,e2)=SIMs(e1,e2)×Atts+SIMv(e1,e2)×Attv,
SIMs(e1,e2) And a SIMv(e1,e2) Representing the similarity of the structural and visual feature representations, Att, respectively, of the entitys、AttvRepresenting the contribution rate weights of the structural feature representation and the visual feature representation, respectively;
Figure FDA0003218418420000042
Attv=1-Atts.
wherein K, b and a are hyper-parameters, degree represents the degree of an entity, NhopRepresents the degree of closeness of the entity to the seed entity:
Nhop=n1-hop×w1+lg(n2-hop×w2),
wherein n is1-hop,n2-hopRespectively representing the number of 1 hop and 2 hops away from the seed entity; w is a1、w2Is a hyper-parameter.
CN202110950895.5A 2021-08-18 2021-08-18 Multi-modal entity alignment method based on triple screening fusion Active CN113656596B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110950895.5A CN113656596B (en) 2021-08-18 2021-08-18 Multi-modal entity alignment method based on triple screening fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110950895.5A CN113656596B (en) 2021-08-18 2021-08-18 Multi-modal entity alignment method based on triple screening fusion

Publications (2)

Publication Number Publication Date
CN113656596A true CN113656596A (en) 2021-11-16
CN113656596B CN113656596B (en) 2022-09-20

Family

ID=78481112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110950895.5A Active CN113656596B (en) 2021-08-18 2021-08-18 Multi-modal entity alignment method based on triple screening fusion

Country Status (1)

Country Link
CN (1) CN113656596B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064926A (en) * 2021-11-24 2022-02-18 国家电网有限公司大数据中心 Multi-modal power knowledge graph construction method, device, equipment and storage medium
CN114579762A (en) * 2022-03-04 2022-06-03 腾讯科技(深圳)有限公司 Knowledge graph alignment method, device, equipment, storage medium and program product
CN115168620A (en) * 2022-09-09 2022-10-11 之江实验室 Self-supervision joint learning method oriented to knowledge graph entity alignment
CN115168599A (en) * 2022-06-20 2022-10-11 北京百度网讯科技有限公司 Multi-triple extraction method, device, equipment, medium and product
CN115982386A (en) * 2023-02-13 2023-04-18 创意信息技术股份有限公司 Automatic generation method for enterprise metadata explanation
CN116090360A (en) * 2023-04-12 2023-05-09 安徽思高智能科技有限公司 RPA flow recommendation method based on multi-modal entity alignment
CN116128056A (en) * 2023-04-18 2023-05-16 安徽思高智能科技有限公司 RPA-oriented multi-modal interaction entity alignment method
CN117407689A (en) * 2023-12-14 2024-01-16 之江实验室 Entity alignment-oriented active learning method and device and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941722A (en) * 2019-10-12 2020-03-31 中国人民解放军国防科技大学 Knowledge graph fusion method based on entity alignment
CN110955780A (en) * 2019-10-12 2020-04-03 中国人民解放军国防科技大学 Entity alignment method for knowledge graph
CN112287126A (en) * 2020-12-24 2021-01-29 中国人民解放军国防科技大学 Entity alignment method and device suitable for multi-mode knowledge graph

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941722A (en) * 2019-10-12 2020-03-31 中国人民解放军国防科技大学 Knowledge graph fusion method based on entity alignment
CN110955780A (en) * 2019-10-12 2020-04-03 中国人民解放军国防科技大学 Entity alignment method for knowledge graph
CN112287126A (en) * 2020-12-24 2021-01-29 中国人民解放军国防科技大学 Entity alignment method and device suitable for multi-mode knowledge graph

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曾维新等: "基于重排序的迭代式实体对齐", 《计算机研究与发展》 *
杜文倩等: "融合实体描述及类型的知识图谱表示学习方法", 《中文信息学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064926A (en) * 2021-11-24 2022-02-18 国家电网有限公司大数据中心 Multi-modal power knowledge graph construction method, device, equipment and storage medium
CN114579762A (en) * 2022-03-04 2022-06-03 腾讯科技(深圳)有限公司 Knowledge graph alignment method, device, equipment, storage medium and program product
CN114579762B (en) * 2022-03-04 2024-03-22 腾讯科技(深圳)有限公司 Knowledge graph alignment method, device, equipment, storage medium and program product
CN115168599A (en) * 2022-06-20 2022-10-11 北京百度网讯科技有限公司 Multi-triple extraction method, device, equipment, medium and product
CN115168620A (en) * 2022-09-09 2022-10-11 之江实验室 Self-supervision joint learning method oriented to knowledge graph entity alignment
CN115982386A (en) * 2023-02-13 2023-04-18 创意信息技术股份有限公司 Automatic generation method for enterprise metadata explanation
CN116090360A (en) * 2023-04-12 2023-05-09 安徽思高智能科技有限公司 RPA flow recommendation method based on multi-modal entity alignment
CN116128056A (en) * 2023-04-18 2023-05-16 安徽思高智能科技有限公司 RPA-oriented multi-modal interaction entity alignment method
CN117407689A (en) * 2023-12-14 2024-01-16 之江实验室 Entity alignment-oriented active learning method and device and electronic device
CN117407689B (en) * 2023-12-14 2024-04-19 之江实验室 Entity alignment-oriented active learning method and device and electronic device

Also Published As

Publication number Publication date
CN113656596B (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN113407759B (en) Multi-modal entity alignment method based on adaptive feature fusion
CN113656596B (en) Multi-modal entity alignment method based on triple screening fusion
Qi et al. Attentive relational networks for mapping images to scene graphs
CN112434169B (en) Knowledge graph construction method and system and computer equipment thereof
CN111737551B (en) Dark network cable detection method based on special-pattern attention neural network
Kumar Knowledge discovery in data using formal concept analysis and random projections
CN110674850A (en) Image description generation method based on attention mechanism
KR102223382B1 (en) Method and apparatus for complementing knowledge based on multi-type entity
CN108647800B (en) Online social network user missing attribute prediction method based on node embedding
Feng et al. Computational social indicators: a case study of chinese university ranking
US20140047091A1 (en) System and method for supervised network clustering
CN114020928A (en) False news identification method based on heterogeneous graph comparison learning
Chu et al. Variational cross-network embedding for anonymized user identity linkage
CN115827968A (en) Individualized knowledge tracking method based on knowledge graph recommendation
Huang et al. Global-local fusion based on adversarial sample generation for image-text matching
CN108509949B (en) Target detection method based on attention map
CN106844338B (en) method for detecting entity column of network table based on dependency relationship between attributes
CN111191059B (en) Image processing method, device, computer storage medium and electronic equipment
Autio et al. On the neural network classification of medical data and an endeavour to balance non-uniform data sets with artificial data extension
Zhao et al. Multi-label node classification on graph-structured data
Han et al. GA-GWNN: Detecting anomalies of online learners by granular computing and graph wavelet convolutional neural network
Vijaya et al. LionRank: lion algorithm-based metasearch engines for re-ranking of webpages
Gao et al. Constrained Local Latent Variable Discovery.
CN116306834A (en) Link prediction method based on global path perception graph neural network model
Dua et al. Generative context pair selection for multi-hop question answering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant