CN111680488A - Cross-language entity alignment method based on knowledge graph multi-view information - Google Patents
Cross-language entity alignment method based on knowledge graph multi-view information Download PDFInfo
- Publication number
- CN111680488A CN111680488A CN202010512003.9A CN202010512003A CN111680488A CN 111680488 A CN111680488 A CN 111680488A CN 202010512003 A CN202010512003 A CN 202010512003A CN 111680488 A CN111680488 A CN 111680488A
- Authority
- CN
- China
- Prior art keywords
- entity
- language
- vector
- text
- description
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/189—Automatic justification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Animal Behavior & Ethology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a cross-language entity alignment method based on knowledge graph multi-view information. Firstly, respectively extracting information according to triples and entity description texts of two language knowledge maps to construct a structure diagram and a text diagram, and coding vector representation on an entity structure and vector representation on a text by using a double-layer diagram convolutional network; then according to the entity description text and the cross-language linguistic data, bidirectional long-and-short-term memory network is used for coding vector representation on the entity description; and calculating a final cross-language alignment entity pair by combining the vector distances of the paired entities under the three visual angles in a weighting mode. The invention realizes the cross-language entity alignment of the knowledge graph, optimizes the entity vector representation based on the multi-view information of the structure and the text, and improves the accuracy of the cross-language entity alignment.
Description
Technical Field
The invention relates to a cross-language entity alignment method based on knowledge graph multi-view information, in particular to a technology for realizing cross-language entity alignment based on knowledge graph structures and text information by utilizing a convolutional neural network.
Background
Due to the rapid development of the internet and the explosion of internet information, people need to structure information so as to further analyze and utilize the information and serve various tasks and scenes, and therefore knowledge maps are produced at the same time. The knowledge graph is a large-scale semantic network in nature, and is a structured knowledge base which formally describes things of an objective world and relations among the things. Entity alignment is to determine whether entities with different names or entities from different sources point to unique objects in the real world. In the multilingual knowledge graph, a part of cross-language entity links generally exist to indicate known entity alignment, and through the known entity pairs and the cross-language entity alignment technology, more entity alignment relations can be found out, so that the information of the knowledge graph is enriched, and the subsequent cross-language task can be expanded.
For the task of aligning across language entities, the traditional methods of academia include methods based on rules and similarity calculation and methods based on machine learning. With the introduction of deep learning and the gradual development and deepening in the field of natural language processing, entity embedding representation and deep neural network-based entity alignment methods become mainstream, most methods are based on structured data of a knowledge graph, usually comparison and calculation of attribute triples and relationship triples, and entity alignment cannot be optimized by effectively utilizing text information.
Disclosure of Invention
The invention aims to encode the entity representation of the knowledge graph from multiple visual angles by using the structural information and the text information of the cross-language knowledge graph, thereby improving the alignment effect of the cross-language entity.
The purpose of the invention is realized by the following technical scheme: a cross-language entity alignment method based on knowledge map multi-view information calculates the distance between entities by encoding entity structure vectors, entity text vectors and entity description vectors, and finds out a cross-language alignment entity pair. The method comprises the following steps:
1) and (3) entity structure vector coding based on the relation triples: and respectively constructing a structure diagram for the knowledge graphs of the two languages according to the relation triples. The structure diagram takes the entities as nodes, edges are formed between the entities with relationships, and the specific weight of the edges is calculated according to the relationships between the entities to form an adjacency matrix of the structure diagram. On the constructed structure chart, a double-layer graph convolution network is adopted for training, and the vector representation of the current entity is continuously updated by using the entity and the entity codes around the entity. The graph convolution networks of the two knowledge-graphs share a weight matrix. And optimizing the entity structure vector representation according to the pre-aligned cross-language aligned entity pair S and the positive and negative example entity pair ternary loss functions.
2) Entity text vector encoding based on entity description information: and combining knowledge maps of the two languages, and constructing a uniform text graph by using the entity and the description text. The text graph has two types of nodes: entity nodes and word nodes in an entity description, have three types of edges: an "entity-descriptor" side, a "descriptor-descriptor" side within a single language, a "descriptor-descriptor" side across languages. Weights are calculated for each type of edge, forming an adjacency matrix. On the constructed text graph, a double-layer graph convolution network is adopted for training, and entity text vector representation is optimized according to a pre-aligned cross-language alignment entity pair S and a positive and negative example entity pair ternary loss function.
3) Entity description vector coding based on entity description information and cross-language corpora: the method comprises the steps of pre-training cross-language aligned word vectors on a single-language corpus and a cross-language parallel corpus of two languages by using Bilbwa, then using a series of word vectors of each entity description as input, and coding the entity descriptions by using a bidirectional long-short-term memory network (BilSTM) to obtain entity description vectors. Optimizing the network structure by optimizing the distance between the entity description vectors of the pre-aligned cross-language aligned entity pair S to obtain the final description vectors of all entities.
4) Computing a cross-language alignment entity pair from the multi-perspective entity vector: regarding each entity in one language knowledge graph, taking each entity of the other language knowledge graph as a candidate entity, calculating the distance between the entity and the candidate entity according to the entity structure vector, the entity text vector and the entity description vector which are respectively obtained in the step 1) and the step 2) and the step 3), sorting the distances from small to large, and selecting the entity pair with the minimum distance as an aligned entity pair.
Further, in step 1), the weight calculation of the adjacency matrix a and the entity vector calculation and loss function in the graph convolution network are specifically as follows:
1.1) weight calculation of adjacency matrix A: for entity eiAnd ejWeight a between themij∈ A is calculated as:
where fun (r) and ifun (r) are the influence scores of the relation r in the forward direction and the reverse direction, respectively, G is the knowledge graph, # Triples _ of _ r is the number of triplets in the relation triplets about the relation r, and # Head _ activities _ of _ r and # Tail _ activities _ of _ r are the number of Head Entities and the number of Tail Entities involved in the triplets of the relation r, respectively.
1.2) calculating entity vectors in the graph convolution network: the input of the graph convolution network is a solid structure characteristic matrixObtained by random initialization, n representing the total number of entities, dsRepresenting the dimension of the solid structure feature vector. The overall calculation formula of the graph convolution network of the structure diagram is as follows:
wherein Adding unit matrix with equal dimension on the basis of adjacent matrix A, adding self information of current entity,is thatA diagonal node degree matrix of. Weight matrixAndare diagonal matrices and the activation function σ uses ReLU (·) max (0,).
1.3) loss function: for an entity pair p ═ (e)1,e2) ∈ S as a positive example entity-pair distance by randomly replacing entity e1Or e2Construct negative instance entity pair p '═ e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f iss(p)=||hs(e1),hs(e2)||1Is a function of scoring the entity distance, calculating the Manhattan distance, h, between the entity structure vectorss(e1),hs(e2) Respectively represent entities e1,e2The structure vector of (1). Gamma raysIs the spacing constraint between the structure vectors.
Further, in the step 2), before the knowledge graph is combined, the entity description information is preprocessed, illegal characters, participles, stop words and the like are filtered, and words with too low frequency in the corpus are filtered.
Further, in the step 2), the weight calculation of the adjacency matrix a and the entity vector calculation and loss function in the graph convolution network are specifically as follows:
2.1) weight calculation of adjacency matrix A: the weight of the three types of edges and the weight calculation mode of the text map adjacency matrix are specifically as follows:
2.1.1) "entity-descriptor" edge:
for the edge formed by the entity and the descriptor, the weight is calculated by using the word frequency-inverse document frequency (TF-IDF), and the calculation formula is as follows:
TFIDF(t,d)=TF(t,d)×IDF(t)
where TF (t, d) calculates the frequency with which the word t appears in the entity description d, nt,dIs the number of occurrences of the word t in the entity description d, ∑t′∈dnt′,dIDF (t) is the inverse document frequency of the word t in the entity description set D, | D | is the total number of entity descriptions in the entity description set, | { D ∈ D: t ∈ D } | is the number of entity descriptions in the entity description set that contain the word t.
2.1.2) single language "descriptor-descriptor" side:
for the edges formed between the descriptors of the single language, firstly, the global word co-occurrence condition is calculated through a sliding window, then, the Point Mutual Information (PMI) of two words is calculated to obtain the weight, and for any two words i and j, the weight calculation formula is as follows:
where # W represents the number of sliding windows in all entity description corpuses, # W (i) represents the number of sliding windows containing word i, and # W (i, j) represents the number of sliding windows containing both word i and word j.
2.1.3) Cross-language "descriptor-descriptor" edge:
for the edges formed between the cross-language descriptors, utilizing a pre-aligned cross-language aligning entity pair S, connecting the word in each entity description text with all the words in the description of the aligning entity pairwise, and calculating the frequency of each formed descriptor pair in the descriptor pairs formed by all the aligning entity pairs to enhance the cross-language information. This method is referred to herein using X-DF (Cross Document frequency).
For words i and j from two knowledge-graph entity descriptions, respectively, the weight calculation formula is:
where count (i, j) represents the number of word pairs consisting of words i and j of the text descriptions of all aligned entity pairs, and count (d) represents the number of word pairs consisting of text descriptions of all aligned entity pairs.
2.1.4) weight calculation mode of text image adjacency matrix:
2.2) calculating entity vectors in the graph convolution network: the input of the graph convolution network is an entity text characteristic matrixObtained by random initialization, n represents the total number of entities, m represents the total number of words, dtRepresenting the entity text feature vector dimension. The general calculation formula of the graph convolution network of the text graph is similar to the step 1.2), and specifically comprises the following steps:
wherein Adding unit matrix with equal dimension on the basis of adjacent matrix A, adding self information of current entity,is thatA diagonal node degree matrix of. Weight matrixAndare diagonal matrices and the activation function σ uses ReLU (·) max (0,).
2.3) loss function: for an entity pair p ═ (e)1,e2) ∈ S as a positive example entity-pair distance by randomly replacing entity e1Or e2Construct negative instance entity pair p '═ e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f ist(p)=||ht(e1),ht(e2)||1Is an entity distance scoring function, calculates the Manhattan distance, h, between entity text vectorst(e1),ht(e2) Respectively represent entities e1,e2The text vector of (2). Gamma raytIs the spacing constraint between text vectors.
Further, the step 3) specifically includes the following sub-steps:
3.1) corpus processing: the available cross-language parallel linguistic data can be directly used, and also partial linguistic data can be extracted from the monolingual linguistic data, and the cross-language parallel linguistic data can be obtained through a translation tool. The cross-language parallel linguistic data is processed into sentence alignment, and operations such as punctuation filtering, word stop and the like are completed on the linguistic data.
3.2) pre-training across language word vectors: cross-language word vector representations are trained using a cross-language word vector training model Bilbowa based on single-language corpora of two languages and sentence-aligned parallel corpora.
3.3) entity description vector coding: pre-training word vector sequence corresponding entity description with wordsExpressing, | s | is the total number of words in the entity description, dd is the dimension of the entity description vector, and the vector representation of the entity description is obtained by optimizing and aligning the distance between the entity vectors by using BilSTM training, wherein the specific formula is as follows:
wherein h istCorresponding to the vector of the tth word of the text description, averaging the vector representations of all the words to obtain an entity description vector hd。
For an entity pair p ═ (e)1,e2) ∈ S as a positive example entity-pair distance by randomly replacing entity e1Or e2Construct negative instance entity pair p '═ e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f isd(p)=||hd(e1),hd(e2)||1Is an entity distance scoring function, and calculates the Manhattan distance h between entity description vectorsd(e1),hd(e2) Respectively represent entities e1And e2The description vector of (1). Gamma raydIs a space constraint between description vectors.
Further, in the step 4), the distance between the entity pairs is calculated in the following specific manner:
the pair of entities p ═ of two different knowledge graphs (e)1,e2) The distance between them is calculated by the formula:
wherein d iss、dt、ddRepresenting the dimension of the entity structure vector, the dimension of the entity text vector, and the dimension of the entity description vector, α and β are hyper-parameters used to weigh the distance of the three parts, respectively.
If only entity structure vectors and entity text vectors are used, the pair of entities p of the two different knowledge-graphs is (e)1,e2) The distance between them is calculated by the formula:
where α is a hyper-parameter used to trade off the two-part distance.
Compared with the prior art, the method has the following beneficial effects:
1. the method provides a model for coding an entity structure and a text to obtain cross-language information based on a graph convolution network, constructs a structural graph and a text graph by designing proper node and edge weights, optimizes coding of an entity vector by adopting the graph convolution network, and improves the alignment accuracy of cross-language entities.
2. The method provides semantic vectors described by the text coding entity based on cross-language word vector pre-training and the bidirectional memory network coding entity description, further increases the coding of entity text information, and improves the alignment effect of cross-language entities.
3. This approach can have good results with less training data and is a higher improvement when more training data is provided than other approaches.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a diagram of an overall model of the present invention;
FIG. 3 is a diagram of a knowledge-graph structure and textual information in accordance with one embodiment of the present invention;
FIG. 4 is a diagram of a solid structure vector coding model according to an embodiment of the present invention;
FIG. 5 is a diagram of an entity text vector coding model according to an embodiment of the present invention;
FIG. 6 is a graph of experimental results of an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples.
As shown in fig. 1, the method for aligning cross-language entities based on knowledge-graph multi-view information provided by the invention comprises the following steps:
1) and (3) entity structure vector coding based on the relation triples: and respectively constructing a structure diagram for the knowledge graphs of the two languages according to the relation triples. The structure diagram takes the entities as nodes, edges are formed between the entities with relationships, and the specific weight of the edges is calculated according to the relationships between the entities to form an adjacency matrix of the structure diagram. On the constructed structure chart, a double-layer graph convolution network is adopted for training, and the graph convolution networks of the two knowledge graphs share a weight matrix. And optimizing the entity structure vector representation according to the pre-aligned cross-language aligned entity pair S and the positive and negative example entity pair ternary loss functions.
2) Entity text vector encoding based on entity description information: and combining knowledge maps of the two languages, and constructing a uniform text graph by using the entity and the description text. The text graph has two types of nodes: entity nodes and word nodes in an entity description, have three types of edges: an "entity-descriptor" side, a "descriptor-descriptor" side within a single language, a "descriptor-descriptor" side across languages. Weights are calculated for each type of edge, forming an adjacency matrix. On the constructed text graph, a double-layer graph convolution network is adopted for training, and entity text vector representation is optimized according to a pre-aligned cross-language alignment entity pair S and a positive and negative example entity pair ternary loss function.
3) Entity description vector coding based on entity description information and cross-language corpora: the method comprises the steps of pre-training word vectors aligned across languages by using Bilbwa on a single language material and a cross-language parallel language material of two languages, then using a series of word vectors described by each entity as input, and coding the entity description by using a bidirectional long-time and short-time memory network to obtain the entity description vectors. Optimizing the network structure by optimizing the distance between the entity description vectors of the pre-aligned cross-language aligned entity pair S to obtain the final description vectors of all entities.
4) Computing a cross-language alignment entity pair from the multi-perspective entity vector: regarding each entity in one language knowledge graph, taking each entity of the other language knowledge graph as a candidate entity, calculating the distance between the entity and the candidate entity according to the entity structure vector, the entity text vector and the entity description vector which are respectively obtained in the step 1) and the step 2) and the step 3), sorting the distances from small to large, and selecting the entity pair with the minimum distance as an aligned entity pair.
Further, in step 1), the weight calculation of the adjacency matrix a and the entity vector calculation and loss function in the graph convolution network are specifically as follows:
1.1) weight calculation of adjacency matrix A: for entity eiAnd ejWeight a between themij∈ A is calculated as:
where fun (r) and ifun (r) are the influence scores of the relation r in the forward direction and the reverse direction, respectively, G is the knowledge graph, # Triples _ of _ r is the number of triplets in the relation triplets about the relation r, and # Head _ activities _ of _ r and # Tail _ activities _ of _ r are the number of Head Entities and the number of Tail Entities involved in the triplets of the relation r, respectively.
1.2) calculating entity vectors in the graph convolution network: the input of the graph convolution network is a solid structure characteristic matrixAnd the method is obtained by random initialization, wherein n represents the total entity number, and ds represents the dimension of the entity structure feature vector. The overall calculation formula of the graph convolution network of the structure diagram is as follows:
wherein Adding unit matrix with equal dimension on the basis of adjacent matrix A, adding self information of current entity,is thatA diagonal node degree matrix of. Weight matrixAndare diagonal matrices and the activation function σ uses ReLU (·) max (0,).
1.3) loss function: for an entity pair p ═ (e)1,e2) ∈ S as a positive example entity-pair distance by randomly replacing entity e1Or e2Construct negative instance entity pair p '═ e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f iss(p)=||hs(e1),hs(e2)||1Is a function of scoring the entity distance, calculating the Manhattan distance, h, between the entity structure vectorss(e1),hs(e2) Respectively represent entities e1,e2The structure vector of (1). Gamma raysIs the spacing constraint between the structure vectors.
Further, in the step 2), before the knowledge graph is combined, the entity description information is preprocessed, illegal characters, participles, stop words and the like are filtered, and words with too low frequency in the corpus are filtered. The weight calculation of the adjacency matrix a and the entity vector calculation and loss function in the graph convolution network are specifically as follows:
2.1) weight calculation of adjacency matrix A: the weight of the three types of edges and the weight calculation mode of the text map adjacency matrix are specifically as follows:
2.1.1) "entity-descriptor" edge:
for the edge formed by the entity and the descriptor, the weight is calculated by using the word frequency-inverse document frequency (TF-IDF), and the calculation formula is as follows:
TFIDF(t,d)=TF(t,d)×IDF(t)
where TF (t, d) calculates the frequency with which the word t appears in the entity description d, nt,dIs the number of occurrences of the word t in the entity description d, ∑t′∈dnt′,dIDF (t) is the inverse document frequency of the word t in the entity description set D, | D | is the total number of entity descriptions in the entity description set, | { D ∈ D: t ∈ D } | is the number of entity descriptions in the entity description set that contain the word t.
2.1.2) single language "descriptor-descriptor" side:
for the edges formed between the descriptors of the single language, firstly, the global word co-occurrence condition is calculated through a sliding window, then, the Point Mutual Information (PMI) of two words is calculated to obtain the weight, and for any two words i and j, the weight calculation formula is as follows:
where # W represents the number of sliding windows in all entity description corpuses, # W (i) represents the number of sliding windows containing word i, and # W (i, j) represents the number of sliding windows containing both word i and word j.
2.1.3) Cross-language "descriptor-descriptor" edge:
for the edges formed between the cross-language descriptors, utilizing a pre-aligned cross-language aligning entity pair S, connecting the word in each entity description text with all the words in the description of the aligning entity pairwise, and calculating the frequency of each formed descriptor pair in the descriptor pairs formed by all the aligning entity pairs to enhance the cross-language information. This method is referred to herein using X-DF (Cross Document frequency).
For words i and j from two knowledge-graph entity descriptions, respectively, the weight calculation formula is:
where count (i, j) represents the number of word pairs consisting of words i and j of the text descriptions of all aligned entity pairs, and count (d) represents the number of word pairs consisting of text descriptions of all aligned entity pairs.
2.1.4) weight calculation mode of text image adjacency matrix:
2.2) calculating entity vectors in the graph convolution network: the input of the graph convolution network is an entity text characteristic matrixObtained by random initialization, n represents the total number of entities, m represents the total number of words, dtRepresenting the entity text feature vector dimension. Graph convolution network for text graphsThe general calculation formula is similar to step 1.2), specifically:
wherein Adding unit matrix with equal dimension on the basis of adjacent matrix A, adding self information of current entity,is thatA diagonal node degree matrix of. Weight matrixAndare diagonal matrices and the activation function σ uses ReLU (·) max (0,).
2.3) loss function: for an entity pair p ═ (e)1,e2) ∈ S as a positive example entity-pair distance by randomly replacing entity e1Or e2Construct negative instance entity pair p '═ e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f ist(p)=||ht(e1),ht(e2)||1Is an entity distance scoring function, calculates the Manhattan distance, h, between entity text vectorst(e1),ht(e2) Respectively represent entities e1,e2The text vector of (2). Gamma raytIs the spacing constraint between text vectors.
Further, the step 3) specifically includes the following sub-steps:
3.1) corpus processing: the available cross-language parallel linguistic data can be directly used, and also partial linguistic data can be extracted from the monolingual linguistic data, and the cross-language parallel linguistic data can be obtained through a translation tool. The cross-language parallel linguistic data is processed into sentence alignment, and operations such as punctuation filtering, word stop and the like are completed on the linguistic data.
3.2) pre-training across language word vectors: cross-language word vector representations are trained using a cross-language word vector training model Bilbowa based on single-language corpora of two languages and sentence-aligned parallel corpora.
3.3) entity description vector coding: pre-training word vector sequence corresponding entity description with wordsRepresenting, | s | is the total number of words in the entity description, ddFor the entity description vector dimension, the distance between aligned entity vectors is optimized by using BilSTM training to obtain the vector representation of the entity description, and the specific formula is as follows:
wherein h istCorresponding to the vector of the tth word of the text description, averaging the vector representations of all the words to obtain an entity description vector hd。
For an entity pair p ═ (e)1,e2) ∈ S as a positive example entity-pair distance by randomly replacing entity e1Or e2Construct negative instance entity pair p '═ e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f isd(p)=||hd(e1),hd(e2)||1Is an entity distance scoring function, and calculates the Manhattan distance h between entity description vectorsd(e1),hd(e2) Respectively represent entities e1And e2The description vector of (1). Gamma raydIs a space constraint between description vectors.
Further, in the step 4), the distance between the entity pairs is calculated in the following specific manner:
the pair of entities p ═ of two different knowledge graphs (e)1,e2) The distance between them is calculated by the formula:
wherein d iss、dt、ddRepresenting the dimension of the entity structure vector, the dimension of the entity text vector, and the dimension of the entity description vector, α and β are hyper-parameters used to weigh the distance of the three parts, respectively.
If only entity structure vectors and entity text vectors are used, the pair of entities p of the two different knowledge-graphs is (e)1,e2) The distance between them is calculated by the formula:
where α is a hyper-parameter used to trade off the two-part distance.
Examples
As shown in fig. 3, an example of the method is given, and the specific steps implemented by the example are described in detail below in conjunction with the method of the present technology (the flow is shown in fig. 1, and the model is shown in fig. 2), as follows:
(1) and (3) entity structure vector coding based on the relation triples: and respectively constructing a structure diagram for the knowledge graphs of the two languages according to the relation triples. The structure diagram takes the entities as nodes (such as the entities 'Batman' and 'Batman'), forms edges (such as 'Batman' and 'superman', 'Batman' and 'S, i.e. erman') between the entities with relations, and calculates the specific weight of the edges according to the relations between the entities to form the adjacency matrix of the diagram. As shown in fig. 4, on the constructed structure diagram, a double-layer graph convolution network is adopted for training, and the graph convolution networks of the two knowledge graphs share a weight matrix. And optimizing the entity structure vector representation according to the pre-aligned cross-language aligned entity pair and the positive and negative example entity pair ternary loss functions.
(2) Entity text vector encoding based on entity description information: processing entity description information, filtering illegal characters, word segmentation, stop words and the like, and filtering words with low frequency in the corpus. And combining knowledge maps of the two languages, and constructing a uniform text graph by using the entity and the description text. The text graph has two types of nodes: entity nodes (such as "Batman" and "Batman") and word nodes in the entity description (such as "DC comics" and "DC comics"), have three types of edges: the "entity-descriptor" edge (e.g., "batman" - "hero"), the "descriptor-descriptor" edge within a single language (e.g., "DC caricature" - "batman"), and the "descriptor-descriptor" edge across languages (e.g., "DC caricature" - "DC communications"). Weights are calculated for each type of edge, forming an adjacency matrix. As shown in fig. 5, on the constructed text graph, a double-layer graph convolution network is adopted for training, and entity text vector representation is optimized according to a pre-aligned cross-language aligned entity pair S and a positive and negative example entity pair ternary loss function.
(3) Entity description vector coding based on entity description information and cross-language corpora: the method comprises the steps of pre-training word vectors aligned across languages by using Bilbwa on a single language material and a cross-language parallel language material of two languages, then using a series of word vectors described by each entity as input, and coding the entity description by using a bidirectional long-time and short-time memory network to obtain the entity description vectors. Optimizing the network structure by optimizing the distance between the entity description vectors of the pre-aligned cross-language aligned entity pair S to obtain the final description vectors of all entities.
(4) Computing a cross-language alignment entity pair from the multi-perspective entity vector: for each entity in one language knowledge graph, taking each entity of the other language knowledge graph as a candidate entity, calculating the distance between the entity and the candidate entity according to an entity structure vector, an entity text vector and an entity description vector (all 100 dimensions), selecting the entity pair with the minimum distance as an aligned entity pair, and finally obtaining the aligned entity pair ' Batman ' -Batman '.
The cross-language entity operation results of this example are shown in table 1, and the model of this method is denoted as STGCN. SE, TE and DE respectively represent entity structure coding, entity text coding and entity description coding. The evaluation index Hits @ k represents the probability that aligned entities are hit in the first k entities when aligned entities are found for all entities of the current language. The final experimental results exceeded the other methods shown on the chinese dataset of the public dataset DBP15K, and the alignment accuracy achieved reached 56.1%.
TABLE 1 Cross-language entity operational Experimental results
When the entity structure coding and the entity text coding are adopted, the effect of pre-aligning entity pairs according to different proportions on an English data set in the DBP15K is shown in FIG. 6, and compared with other methods, the best effect is always achieved when the data amount is small to large, and the advantage is greater when the data amount is large.
The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.
Claims (7)
1. A cross-language entity alignment method based on knowledge graph multi-view information is characterized by comprising the following steps:
1) and (3) entity structure vector coding based on the relation triples: constructing a structure diagram for the knowledge graphs of the two languages according to the relation triples; the structure diagram takes the entities as nodes, edges are formed between the entities with relationships, and the specific weight of the edges is calculated according to the relationships between the entities to form an adjacent matrix of the diagram; on the constructed structure chart, a double-layer graph convolution network is adopted for training, and the vector representation of the current entity is continuously updated by using the entity and the entity codes around the entity; the graph convolution networks of the two knowledge graphs share a weight matrix; and optimizing the entity structure vector representation according to the pre-aligned cross-language aligned entity pair S and the positive and negative example entity pair ternary loss functions.
2) Entity text vector encoding based on entity description information: combining knowledge maps of the two languages, and constructing a uniform text graph by using an entity and a description text; the text graph has two types of nodes: entity nodes and word nodes in an entity description, have three types of edges: an entity-descriptor side, a descriptor-descriptor side within a single language, a descriptor-descriptor side across languages; calculating the weight for each type of edge to form an adjacency matrix; on the constructed text graph, a double-layer graph convolution network is adopted for training, and entity text vector representation is optimized according to a pre-aligned cross-language alignment entity pair S and a positive and negative example entity pair ternary loss function.
3) Entity description vector coding based on entity description information and cross-language corpora: using Bilbwa to pre-train word vectors aligned across languages on a single language material and a cross-language parallel language material of two languages, then using a series of word vectors described by each entity as input, and coding the entity description by using a bidirectional long-and-short-term memory network (BilSTM) to obtain entity description vectors; optimizing the network structure by optimizing the distance between the entity description vectors of the pre-aligned cross-language aligned entity pair S to obtain the final description vectors of all entities.
4) Computing a cross-language alignment entity pair from the multi-perspective entity vector: regarding each entity in one language knowledge graph, taking each entity of the other language knowledge graph as a candidate entity, calculating the distance between the entity and the candidate entity according to the entity structure vector, the entity text vector and the entity description vector which are respectively obtained in the step 1) and the step 2) and the step 3), sorting the distances from small to large, and selecting the entity pair with the minimum distance as an aligned entity pair.
2. The method according to claim 1, wherein in step 1), the weight calculation of the adjacency matrix a and the entity vector calculation and loss function in the graph convolution network are specifically as follows:
1.1) weight calculation of adjacency matrix A: for entity eiAnd ejWeight a between themij∈ A is calculated as:
where fun (r) and ifun (r) are the influence scores of the relation r in the forward direction and the reverse direction, respectively, G is the knowledge graph, # Triples _ of _ r is the number of triplets in the relation triplets about the relation r, and # Head _ activities _ of _ r and # Tail _ activities _ of _ r are the number of Head Entities and the number of Tail Entities involved in the triplets of the relation r, respectively.
1.2) calculating entity vectors in the graph convolution network: output of graph convolution networkInto a matrix of physical structural featuresObtained by random initialization, n representing the total number of entities, dsRepresenting the dimension of the entity structure feature vector; the overall calculation formula of the graph convolution network of the structure diagram is as follows:
whereinAdding unit matrix with equal dimension on the basis of adjacent matrix A, adding self information of current entity,is thatA diagonal node degree matrix of (a); weight matrix Ws (0)And Ws (1)Are diagonal matrices and the activation function σ uses ReLU (·) max (0,).
1.3) loss function: for an entity pair p ═ (e)1,e2) ∈ S as a positive example entity-pair distance by randomly replacing entity e1Or e2Construct negative instance entity pair p '═ e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f iss(p)=||hs(e1),hs(e2)||1Is a function of scoring the entity distance, calculating the Manhattan distance, h, between the entity structure vectorss(e1),hs(e2) Respectively represent entities e1,e2The structure vector of (1); gamma raysIs the spacing constraint between the structure vectors.
3. The method as claimed in claim 1, wherein in step 2), before the knowledge-graph is combined, entity description information is preprocessed to filter illegal characters, participles, stop words and the like, and to filter words with too low frequency in the corpus.
4. The method according to claim 1, wherein in step 2), the weight calculation of the adjacency matrix a and the entity vector calculation and loss function in the graph convolution network are specifically as follows:
2.1) weight calculation of adjacency matrix A: the weight of the three types of edges and the weight calculation mode of the text map adjacency matrix are specifically as follows:
2.1.1) "entity-descriptor" edge:
for the edge formed by the entity and the descriptor, the weight is calculated by using the word frequency-inverse document frequency (TF-IDF), and the calculation formula is as follows:
TFIDF(t,d)=TF(t,d)×IDF(t)
where TF (t, d) calculates the frequency with which the word t appears in the entity description d, nt,dIs the number of occurrences of the word t in the entity description d, ∑t′∈dnt′,dIs the total number of words in the entity description D, IDF (t) is the inverse document frequency of the word t in the entity description set D, | D | is the total number of entity descriptions in the entity description set, | { D ∈ D: t ∈ D } | is the entity description setThe number of entity descriptions containing the word t.
2.1.2) single language "descriptor-descriptor" side:
for the edges formed between the descriptors of the single language, firstly, the global word co-occurrence condition is calculated through a sliding window, then, the Point Mutual Information (PMI) of two words is calculated to obtain the weight, and for any two words i and j, the weight calculation formula is as follows:
where # W represents the number of sliding windows in all entity description corpuses, # W (i) represents the number of sliding windows containing word i, and # W (i, j) represents the number of sliding windows containing both word i and word j.
2.1.3) Cross-language "descriptor-descriptor" edge:
for the edges formed between the cross-language descriptors, connecting the words in each entity description text and all the words in the description of the aligned entity pair pairwise by using a pre-aligned cross-language aligned entity pair S, and calculating the frequency of each formed descriptor pair in the descriptor pairs formed by all the aligned entity pairs to enhance the cross-language information; for words i and j from two knowledge-graph entity descriptions, respectively, the weight calculation formula is:
where count (i, j) represents the number of word pairs consisting of words i and j of the text descriptions of all aligned entity pairs, and count (d) represents the number of word pairs consisting of text descriptions of all aligned entity pairs.
2.1.4) weight calculation mode of text image adjacency matrix:
2.2) calculating entity vectors in the graph convolution network: the input of the graph convolution network is an entity text characteristic matrixObtained by random initialization, n represents the total number of entities, m represents the total number of words, dtRepresenting the dimension of the entity text feature vector; the overall calculation formula of the graph convolution network of the text graph is as follows:
whereinAdding unit matrix with equal dimension on the basis of adjacent matrix A, adding self information of current entity,is thatA diagonal node degree matrix of (a); weight matrix Ws (0)And Ws (1)Are diagonal matrices and the activation function σ uses ReLU (·) max (0,).
2.3) loss function: for an entity pair p ═ (e)1,e2) ∈ S as a positive case entity pair distance, a negative case entity pair p 'is constructed by randomly replacing the entity e1 or e2 (e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f ist(p)=||ht(e1),ht(e2)||1Is an entity distance scoring function, calculates the Manhattan distance, h, between entity text vectorst(e1),ht(e2) Respectively represent entities e1,e2The text vector of (2); gamma raytIs the spacing constraint between text vectors.
5. The method for aligning cross-language entities based on knowledge-graph multi-view information as claimed in claim 1, wherein the step 3) comprises the following steps:
3.1) corpus processing: the cross-language parallel corpus is processed to be sentence-aligned.
3.2) pre-training across language word vectors: cross-language word vector representations are trained using a cross-language word vector training model Bilbowa based on single-language corpora of two languages and sentence-aligned parallel corpora.
3.3) entity description vector coding: pre-training word vector sequence corresponding entity description with wordsRepresenting, | s | is the total number of words in the entity description, ddFor the entity description vector dimension, the distance between aligned entity vectors is optimized by using BilSTM training to obtain the vector representation of the entity description, and the specific formula is as follows:
wherein h istCorresponding to the vector of the tth word of the text description, averaging the vector representations of all the words to obtain an entity description vector hd;
For an entity pair p ═ (e)1,e2) ∈ S as a positive case entity pair distance, a negative case entity pair p 'is constructed by randomly replacing the entity e1 or e2 (e'1,e′2)∈Sp′,Sp' is a negative example set of entity pairs, then minimize the following objective function:
wherein f isd(p)=||hd(e1),hd(e2)||1Is an entity distance scoring function, calculates the Manhattan distance, h, between entity description vectorsd(e1),hd(e2) Respectively represent entities e1And e2The description vector of (2); gamma raydIs a space constraint between description vectors.
6. The method according to claim 5, wherein in step 3.1), the available cross-language parallel corpus can be used directly, or a part of corpus can be extracted from a monolingual corpus, and the cross-language parallel corpus can be obtained through a translation tool; processing cross-language parallel linguistic data into sentence alignment, completing operations of punctuation filtering, stop word removal and the like on the linguistic data, and then performing cross-language word vector pre-training.
7. The method according to claim 1, wherein in step 4), the distance between the entity pairs is calculated as follows:
the pair of entities p ═ of two different knowledge graphs (e)1,e2) The distance between them is calculated by the formula:
wherein d iss、dt、ddα and β are hyper-parameters used to weigh the distance of the three parts, respectively representing the dimension of the entity structure vector, the dimension of the entity text vector, and the dimension of the entity description vector;
if only entity structure vectors and entity text vectors are used, the pair of entities p of the two different knowledge-graphs is (e)1,e2) The distance between them is calculated by the formula:
where α is a hyper-parameter used to trade off the two-part distance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010512003.9A CN111680488B (en) | 2020-06-08 | 2020-06-08 | Cross-language entity alignment method based on knowledge graph multi-view information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010512003.9A CN111680488B (en) | 2020-06-08 | 2020-06-08 | Cross-language entity alignment method based on knowledge graph multi-view information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111680488A true CN111680488A (en) | 2020-09-18 |
CN111680488B CN111680488B (en) | 2023-07-21 |
Family
ID=72453997
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010512003.9A Active CN111680488B (en) | 2020-06-08 | 2020-06-08 | Cross-language entity alignment method based on knowledge graph multi-view information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680488B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287123A (en) * | 2020-11-19 | 2021-01-29 | 国网湖南省电力有限公司 | Entity alignment method and device based on edge type attention mechanism |
CN112287126A (en) * | 2020-12-24 | 2021-01-29 | 中国人民解放军国防科技大学 | Entity alignment method and device suitable for multi-mode knowledge graph |
CN112380864A (en) * | 2020-11-03 | 2021-02-19 | 广西大学 | Text triple labeling sample enhancement method based on translation |
CN113487088A (en) * | 2021-07-06 | 2021-10-08 | 哈尔滨工业大学(深圳) | Traffic prediction method and device based on dynamic space-time diagram convolution attention model |
CN113987121A (en) * | 2021-10-21 | 2022-01-28 | 泰康保险集团股份有限公司 | Question-answer processing method, device, equipment and readable medium of multi-language reasoning model |
CN114357114A (en) * | 2022-01-04 | 2022-04-15 | 新华智云科技有限公司 | Entity cleaning method and system based on unsupervised learning |
CN114896394A (en) * | 2022-04-18 | 2022-08-12 | 桂林电子科技大学 | Event trigger detection and classification method based on multi-language pre-training model |
CN115795060A (en) * | 2023-02-06 | 2023-03-14 | 吉奥时空信息技术股份有限公司 | Entity alignment method based on knowledge enhancement |
CN117435714A (en) * | 2023-12-20 | 2024-01-23 | 湖南紫薇垣信息系统有限公司 | Knowledge graph-based database and middleware problem intelligent diagnosis system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110232186A (en) * | 2019-05-20 | 2019-09-13 | 浙江大学 | The knowledge mapping for merging entity description, stratification type and text relation information indicates learning method |
CN110704576A (en) * | 2019-09-30 | 2020-01-17 | 北京邮电大学 | Text-based entity relationship extraction method and device |
CN110955780A (en) * | 2019-10-12 | 2020-04-03 | 中国人民解放军国防科技大学 | Entity alignment method for knowledge graph |
-
2020
- 2020-06-08 CN CN202010512003.9A patent/CN111680488B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110232186A (en) * | 2019-05-20 | 2019-09-13 | 浙江大学 | The knowledge mapping for merging entity description, stratification type and text relation information indicates learning method |
CN110704576A (en) * | 2019-09-30 | 2020-01-17 | 北京邮电大学 | Text-based entity relationship extraction method and device |
CN110955780A (en) * | 2019-10-12 | 2020-04-03 | 中国人民解放军国防科技大学 | Entity alignment method for knowledge graph |
Non-Patent Citations (5)
Title |
---|
HONG YANG等: "《Guiding Cross-lingual Entity Alignment via Adversarial Knowledge Embedding》" * |
张鸿,吴飞等: "《一种基于内容相关性的跨媒体检索方法》" * |
杨茜: "《知识图谱中多粒度关系链接技术研究》" * |
王巍巍;王志刚;潘亮铭;刘阳;张江涛;: "双语影视知识图谱的构建研究" * |
苏佳林;王元卓;靳小龙;程学旗;: "自适应属性选择的实体对齐方法" * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112380864A (en) * | 2020-11-03 | 2021-02-19 | 广西大学 | Text triple labeling sample enhancement method based on translation |
CN112287123A (en) * | 2020-11-19 | 2021-01-29 | 国网湖南省电力有限公司 | Entity alignment method and device based on edge type attention mechanism |
CN112287123B (en) * | 2020-11-19 | 2022-02-22 | 国网湖南省电力有限公司 | Entity alignment method and device based on edge type attention mechanism |
CN112287126B (en) * | 2020-12-24 | 2021-03-19 | 中国人民解放军国防科技大学 | Entity alignment method and device suitable for multi-mode knowledge graph |
CN112287126A (en) * | 2020-12-24 | 2021-01-29 | 中国人民解放军国防科技大学 | Entity alignment method and device suitable for multi-mode knowledge graph |
CN113487088A (en) * | 2021-07-06 | 2021-10-08 | 哈尔滨工业大学(深圳) | Traffic prediction method and device based on dynamic space-time diagram convolution attention model |
CN113987121A (en) * | 2021-10-21 | 2022-01-28 | 泰康保险集团股份有限公司 | Question-answer processing method, device, equipment and readable medium of multi-language reasoning model |
CN114357114A (en) * | 2022-01-04 | 2022-04-15 | 新华智云科技有限公司 | Entity cleaning method and system based on unsupervised learning |
CN114896394A (en) * | 2022-04-18 | 2022-08-12 | 桂林电子科技大学 | Event trigger detection and classification method based on multi-language pre-training model |
CN114896394B (en) * | 2022-04-18 | 2024-04-05 | 桂林电子科技大学 | Event trigger word detection and classification method based on multilingual pre-training model |
CN115795060A (en) * | 2023-02-06 | 2023-03-14 | 吉奥时空信息技术股份有限公司 | Entity alignment method based on knowledge enhancement |
CN117435714A (en) * | 2023-12-20 | 2024-01-23 | 湖南紫薇垣信息系统有限公司 | Knowledge graph-based database and middleware problem intelligent diagnosis system |
CN117435714B (en) * | 2023-12-20 | 2024-03-08 | 湖南紫薇垣信息系统有限公司 | Knowledge graph-based database and middleware problem intelligent diagnosis system |
Also Published As
Publication number | Publication date |
---|---|
CN111680488B (en) | 2023-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111680488B (en) | Cross-language entity alignment method based on knowledge graph multi-view information | |
CN108197111B (en) | Text automatic summarization method based on fusion semantic clustering | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
Alkhatlan et al. | Word sense disambiguation for arabic exploiting arabic wordnet and word embedding | |
CN109446333A (en) | A kind of method that realizing Chinese Text Categorization and relevant device | |
CN112417854A (en) | Chinese document abstraction type abstract method | |
CN109101490B (en) | Factual implicit emotion recognition method and system based on fusion feature representation | |
Panda | Developing an efficient text pre-processing method with sparse generative Naive Bayes for text mining | |
CN107180026A (en) | The event phrase learning method and device of a kind of word-based embedded Semantic mapping | |
Ouyang et al. | Spatial pyramid pooling mechanism in 3D convolutional network for sentence-level classification | |
CN114969304A (en) | Case public opinion multi-document generation type abstract method based on element graph attention | |
CN111144410A (en) | Cross-modal image semantic extraction method, system, device and medium | |
Errami et al. | Sentiment Analysis onMoroccan Dialect based on ML and Social Media Content Detection | |
CN112163089A (en) | Military high-technology text classification method and system fusing named entity recognition | |
Aggarwal et al. | " Did you really mean what you said?": Sarcasm Detection in Hindi-English Code-Mixed Data using Bilingual Word Embeddings | |
Jin et al. | Multi-label sentiment analysis base on BERT with modified TF-IDF | |
Jia et al. | Attention in character-based BiLSTM-CRF for Chinese named entity recognition | |
CN117251524A (en) | Short text classification method based on multi-strategy fusion | |
Basu et al. | Multimodal sentiment analysis of# metoo tweets using focal loss (grand challenge) | |
CN115759119A (en) | Financial text emotion analysis method, system, medium and equipment | |
Kumar et al. | A reliable technique for sentiment analysis on tweets via machine learning and bert | |
Nabil et al. | Cufe at semeval-2016 task 4: A gated recurrent model for sentiment classification | |
Sarhan et al. | Arabic relation extraction: A survey | |
Uddin et al. | Extracting severe negative sentence pattern from bangla data via long short-term memory neural network | |
Dongjie et al. | Multimodal knowledge learning for named entity disambiguation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |