CN109299284B

CN109299284B - Knowledge graph representation learning method based on structural information and text description

Info

Publication number: CN109299284B
Application number: CN201811011812.0A
Authority: CN
Inventors: 姚宏; 李圣文; 李清涛; 刘超; 董理君; 康晓军
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2021-07-20
Anticipated expiration: 2038-08-31
Also published as: CN109299284A

Abstract

The invention discloses a knowledge graph representation learning method based on structural information and text description, and aims to map entities and relations in triples into a low-dimensional continuous real-valued space. The invention aims to improve the vector representation of the entity in the knowledge representation; obtaining corresponding text description information of an entity from an existing knowledge base Freebase, performing word vector representation on each description by adopting word2vec, then using a word sum mean vector as a vector representation of the description, performing vector representation on the description by adopting a sentence vector generation mode of doc2vec, and then using the word vector as the input of a CNN text encoder to obtain a representation vector based on a description text of each entity; and then, evaluating the influence of the symbol-based expression vector, the network structure-based expression vector and the description text-based expression vector on the final expression vector of the entity in the knowledge base by using the weight in the joint expression, completing the fusion of the structure information and the text information, and improving the accuracy of the expression of the knowledge graph.

Description

Knowledge graph representation learning method based on structural information and text description

Technical Field

The invention particularly relates to a knowledge graph representation learning method based on structural information and text description.

Background

The knowledge graph is an important component of the NLP technology in tasks such as intelligent question answering, web searching, semantic analysis and the like. Knowledge maps tend to be large in size, containing hundreds of entities and billions of knowledge, but are often not complete enough. Therefore, the problem of data sparseness in the knowledge graph is solved by using knowledge graph completion. Knowledge-graphs are often represented in a network, where nodes represent entities, edges represent relationships between two entities, and each piece of knowledge is represented in the form of a triplet (head entity, relationship, tail entity). Based on the symbolic representation method such as the triples, designers must design various graph algorithms for different applications in knowledge graph completion, and with the continuous increase of the scale of the knowledge graph, the computation is more and more infeasible due to poor expansibility. In addition, the graph-based KG faces data sparseness and other problems in application, and the graph-based KG is inconvenient in machine learning, while in the current big data era, machine learning is an essential technical tool for big data automation and intelligent words. In the face of these problems, knowledge graph representation learning (also referred to as knowledge graph embedded learning) has been proposed. The expression learning method of the knowledge graph aims at expressing the entities and the relations of the knowledge graph into dense low-dimensional real value vectors, further efficiently calculating the entities, the relations and complex associations among the entities and the relations in the low-dimensional vectors, and playing an important role in construction, reasoning, fusion, mining and application of the knowledge graph.

The existing problem is that the representation learning mainly comprises a distributed representation-based TransX model and a neural network-based model. The translation-based model has good performance in knowledge representation learning, however, most of these translation models only consider the fact triple symbolic representation in the knowledge graph when performing vector projection, and ignore some implicit semantic information in the knowledge graph, which may cause the learned vector not to accurately represent the semantic relationship contained in the knowledge graph. The existing knowledge base has a large amount of text description information of entities, the text description corresponding to the triple entities contains a large amount of extra semantic information, and the text description can provide more accurate semantic representation for entity representation by combining the text information and is beneficial to finding semantic correlation among different entities. Of course, in the existing representation method using entity description, simply concatenating the symbol-based triplet structure vector and the text description vector does not accurately determine whether the information of the two information sources is reasonable for the entity to be finally represented in the multidimensional vector space. The relations and entities in the knowledge graph are projected into a multi-dimensional vector space when representation learning is carried out, and the specific physical meaning of each vector is difficult to explain at present, and only the relative position exists.

For example, triples based on text descriptions in the knowledge base Fressbase, where the text description corresponding to each entity provides certain semantic information for the representation of the entity in the triples, but in many knowledge graph representation learning methods, when processing the triples, the symbol-based triplet learning only considers the structural information representation of the triples themselves; the text-based representation learning method simply concatenates the structural information vector and the text information vector; semantic information in the text is not efficiently utilized to improve reasonable representation of the entity in the vector space; moreover, the relative structure information of the entity in the map is not added into the expression vector of the entity, and the information of the entity is lost to a certain extent.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a knowledge graph representation learning method based on structural information and text description to solve the problem, aiming at the defects that the prior knowledge graph representation learning method only considers the fact triple symbolic representation in the knowledge graph and ignores some implicit semantic information in the knowledge graph.

A knowledge graph representation learning method based on structural information and text description comprises the following steps:

step 1: acquiring triple information from a preset knowledge base, wherein each triple information comprises a head entity, a relation and a tail entity, processing the acquired triple information by adopting a TransE learning method to respectively obtain representation vectors of the head entity, the relation and the tail entity in each triple information, and the representation vectors of the head entity, the relation and the tail entity in each triple information form a symbol-based representation vector;

step 2: storing each symbol-based expression vector obtained in the step one into a database, and establishing a corresponding index;

and step 3: sequentially inquiring each entity in the preset knowledge base to obtain entities corresponding to the inquired entities respectively serving as a head entity and a tail entity;

and 4, step 4: obtaining an entity set corresponding to each queried entity according to the step 3, and for each queried entity: querying the ids of all entities contained in the corresponding entity set in the preset knowledge base, obtaining the ids of all entities in the corresponding entity set in a random walk mode, and connecting to form an entity id sequence;

and 5: according to each entity id sequence obtained in the specific step 4, learning by respectively adopting a skip-gram model to obtain a representation vector based on a network structure;

step 6: respectively preprocessing the description texts of all entities in the preset knowledge base;

and 7: performing word vector generation on the description texts of the entities preprocessed in the step 6 by adopting a CBOW method in word2vec respectively to obtain expression word vectors respectively;

and 8: for each representative word vector obtained in step 7: respectively taking the expression word vectors as the input of a CNN encoder, setting two convolution layers and two pooling layers, and learning to obtain the expression vectors of each entity based on the description text;

and step 9: respectively splicing the symbolic-based expression vector, the network structure-based expression vector and the description text-based expression vector of the same entity to obtain the spliced vectors of all entities;

step 10: and learning the splicing vector of each entity by adopting a TransE learning method to obtain a final expression vector of each entity.

Further, a subset FB15k of Freebase is used as the preset knowledge base.

Further, the preprocessing in step 6 specifically includes removing stop words, and connecting entity names composed of a plurality of characters as a word.

Further, in the process of generating the word vector in step 7, the dimension size, the min-count, and the sliding window value of each word vector need to be set.

The invention provides a method for adding a network structure into representation learning of a knowledge graph, and aims to effectively fuse vectors of three information sources, namely triple symbolic representation, text description representation and relative network structure together, so as to improve the representation quality of each entity vector, provide vector sources containing more semantic and structural information for upper-layer application, and simultaneously introduce network structure information of relative positions of entities in the knowledge graph into knowledge representation for the first time.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

fig. 1 is a structural diagram of a knowledge graph representation learning method based on structural information and text description according to the present invention.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

step 1: acquiring triple information from a preset knowledge base Freebase, wherein the triple information comprises a head entity, a relation and a tail entity, and representing vectors of the head entity, the relation and the tail entity in the triple information are respectively obtained by adopting a TransE learning method to form a symbol-based representing vector (h, r, t), wherein h represents the representing vector of the head entity, r represents the representing vector of the relation, t represents the representing vector of the tail entity, and the dimension of each vector is set to be 100 dimensions; the loss function is as follows:

step 2: storing the expression vector based on the symbol obtained in the step one into a database, and establishing a corresponding index;

and step 3: sequentially inquiring each entity in a preset knowledge base to obtain entities and relations corresponding to the inquired entities when the inquired entities are respectively used as a head entity and a tail entity;

and 4, step 4: obtaining a corresponding entity set of the inquired entities according to the step 3, inquiring the ids of all entities contained in the corresponding entity set in a database, obtaining the ids of all entities in the corresponding entity set in a random walk mode, and connecting to form an entity id sequence;

and 5: and (4) according to the entity id sequence obtained in the specific step (4), learning by adopting a skip-gram model to obtain a network structure-based expression vector.

Step 6: selecting an entity description text from a preset knowledge base, and preprocessing the selected entity description text: removing stop words, and connecting entity names consisting of a plurality of characters to be used as a word;

and 7: generating word vectors for the entity description texts preprocessed in the step 6 by adopting a CBOW method in word2vec to obtain expression word vectors; during model training, setting the dimension of each word vector to be 100, and setting a proper min-count and a sliding window value to obtain a representation word vector.

And 8: and (4) taking the expression word vector obtained in the step (7) as the input of a CNN encoder, setting two convolution layers and two pooling layers, and learning to obtain the expression vector based on the description text. In consideration of the fact that the word order problem in the text is ignored in the CBOW model, the patent adopts CNN as a text encoder to perform encoding learning on the description text of each entity on the basis of the CBOW model;

and step 9: splicing the symbol-based representation vector, the network structure-based representation vector and the description text-based representation vector to obtain a splicing vector of the entity:

e＝[e_s:e_g:e_d]

wherein e is_sRepresents a structural vector, e_gRepresenting a representation vector based on a network structure, e_dRepresenting the entity by a representation vector based on the description text, and e representing the entity splicing vector;

step 10: by adopting a TransE learning method, a joint learning entity represents a vector based on symbols, a vector based on a graph structure and a vector based on a descriptive text, and a scoring function is as follows:

f＝‖h+r-t‖

h and t respectively represent vector representations of a head entity and a tail entity, and the values of the vector representations are equal to a symbolic-based representation vector, a network structure-based representation vector and a splicing vector of a description text-based representation vector corresponding to the entities; and substituting the scoring function into a loss function of TransE to participate in model training to obtain a final expression vector of the entity.

Wherein, [ x ]]₊The number with a large value in the two values of 0 and x is represented, epsilon is a hyperparameter, h represents a head entity vector in the triple, t represents a tail entity vector in the triple, r represents a relation representation vector in the triple, and h ', r ' and t ' respectively represent the head entity, the relation and the tail entity representation vectors in the negative-case triple. S' represents the negative example triple set, and can be obtained through the following formula

S′＝{(h′,r,t)|h′∈E}∪{(h,r,t′)|t′∈E}∪{(h,r′,t)|r′∈R}

E, R respectively represents the set of entities and relations in the knowledge base, the negative sampling mode of the entities and relations in a triple is that the entities or relations in the non-current triple in the knowledge base are randomly selected to replace the entities or relations in the current triple, and for the entity negative sampling, only one of the head entity or the tail entity is replaced to form a negative example triple each time.

And (4) finally representing the vectors of the entities and the relations in the knowledge graph obtained in the step (10), and applying the vectors to a knowledge graph completion task to verify the effect of the model.

FIG. 1 is a diagram showing an overall framework of a symbolic-based representation vector, a graph structure-based representation vector and a descriptive text-based representation vector of a joint entity, a textual description vector is obtained by inputting a textual descriptor vector corresponding to the entity into a text Encoder, and then a vector representation of the entity in a vector space is obtained by weighted summation of the textual description word vector, the entity structure vector obtained by a symbolic-based representation learning model TransE and a structure information vector based on a relative position, the model takes h + r ≈ t as a learning target, wherein the text vector representation is obtained by training of a text Encoder Encoder, and x is a vector representation of the entity in the vector space₁,x₂…,x_NRepresenting the word vector of each word in the text description of the entity, obtaining a triple vector by training a tuple-based representation method TransE, obtaining a relation vector by training the TransE, obtaining a Network vector by the deepwalk Network embedding mode adopted in the implementation step two, and combining the word vector and the relation vectorThe learning is trained from the target learning function mentioned in the specific step 12.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A knowledge graph representation learning method based on structural information and text description is characterized by comprising the following steps:

step 2: storing each symbol-based expression vector obtained in the step 1 into a database, and establishing a corresponding index;

2. The method for learning representation of knowledge graph based on structural information and text description as claimed in claim 1, wherein a subset FB15k of Freebase is used as the predetermined knowledge base.

3. The method as claimed in claim 1, wherein the preprocessing in step 6 includes removing stop words, and connecting entity names composed of a plurality of characters as a word.

4. The method as claimed in claim 1, wherein in the step 7 of generating the word vectors, the dimension size, min-count and sliding window value of each word vector need to be set.