CN109299284B - Knowledge graph representation learning method based on structural information and text description - Google Patents

Knowledge graph representation learning method based on structural information and text description Download PDF

Info

Publication number
CN109299284B
CN109299284B CN201811011812.0A CN201811011812A CN109299284B CN 109299284 B CN109299284 B CN 109299284B CN 201811011812 A CN201811011812 A CN 201811011812A CN 109299284 B CN109299284 B CN 109299284B
Authority
CN
China
Prior art keywords
entity
vector
representation
description
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811011812.0A
Other languages
Chinese (zh)
Other versions
CN109299284A (en
Inventor
姚宏
李圣文
李清涛
刘超
董理君
康晓军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201811011812.0A priority Critical patent/CN109299284B/en
Publication of CN109299284A publication Critical patent/CN109299284A/en
Application granted granted Critical
Publication of CN109299284B publication Critical patent/CN109299284B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a knowledge graph representation learning method based on structural information and text description, and aims to map entities and relations in triples into a low-dimensional continuous real-valued space. The invention aims to improve the vector representation of the entity in the knowledge representation; obtaining corresponding text description information of an entity from an existing knowledge base Freebase, performing word vector representation on each description by adopting word2vec, then using a word sum mean vector as a vector representation of the description, performing vector representation on the description by adopting a sentence vector generation mode of doc2vec, and then using the word vector as the input of a CNN text encoder to obtain a representation vector based on a description text of each entity; and then, evaluating the influence of the symbol-based expression vector, the network structure-based expression vector and the description text-based expression vector on the final expression vector of the entity in the knowledge base by using the weight in the joint expression, completing the fusion of the structure information and the text information, and improving the accuracy of the expression of the knowledge graph.

Description

Knowledge graph representation learning method based on structural information and text description
Technical Field
The invention particularly relates to a knowledge graph representation learning method based on structural information and text description.
Background
The knowledge graph is an important component of the NLP technology in tasks such as intelligent question answering, web searching, semantic analysis and the like. Knowledge maps tend to be large in size, containing hundreds of entities and billions of knowledge, but are often not complete enough. Therefore, the problem of data sparseness in the knowledge graph is solved by using knowledge graph completion. Knowledge-graphs are often represented in a network, where nodes represent entities, edges represent relationships between two entities, and each piece of knowledge is represented in the form of a triplet (head entity, relationship, tail entity). Based on the symbolic representation method such as the triples, designers must design various graph algorithms for different applications in knowledge graph completion, and with the continuous increase of the scale of the knowledge graph, the computation is more and more infeasible due to poor expansibility. In addition, the graph-based KG faces data sparseness and other problems in application, and the graph-based KG is inconvenient in machine learning, while in the current big data era, machine learning is an essential technical tool for big data automation and intelligent words. In the face of these problems, knowledge graph representation learning (also referred to as knowledge graph embedded learning) has been proposed. The expression learning method of the knowledge graph aims at expressing the entities and the relations of the knowledge graph into dense low-dimensional real value vectors, further efficiently calculating the entities, the relations and complex associations among the entities and the relations in the low-dimensional vectors, and playing an important role in construction, reasoning, fusion, mining and application of the knowledge graph.
The existing problem is that the representation learning mainly comprises a distributed representation-based TransX model and a neural network-based model. The translation-based model has good performance in knowledge representation learning, however, most of these translation models only consider the fact triple symbolic representation in the knowledge graph when performing vector projection, and ignore some implicit semantic information in the knowledge graph, which may cause the learned vector not to accurately represent the semantic relationship contained in the knowledge graph. The existing knowledge base has a large amount of text description information of entities, the text description corresponding to the triple entities contains a large amount of extra semantic information, and the text description can provide more accurate semantic representation for entity representation by combining the text information and is beneficial to finding semantic correlation among different entities. Of course, in the existing representation method using entity description, simply concatenating the symbol-based triplet structure vector and the text description vector does not accurately determine whether the information of the two information sources is reasonable for the entity to be finally represented in the multidimensional vector space. The relations and entities in the knowledge graph are projected into a multi-dimensional vector space when representation learning is carried out, and the specific physical meaning of each vector is difficult to explain at present, and only the relative position exists.
For example, triples based on text descriptions in the knowledge base Fressbase, where the text description corresponding to each entity provides certain semantic information for the representation of the entity in the triples, but in many knowledge graph representation learning methods, when processing the triples, the symbol-based triplet learning only considers the structural information representation of the triples themselves; the text-based representation learning method simply concatenates the structural information vector and the text information vector; semantic information in the text is not efficiently utilized to improve reasonable representation of the entity in the vector space; moreover, the relative structure information of the entity in the map is not added into the expression vector of the entity, and the information of the entity is lost to a certain extent.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a knowledge graph representation learning method based on structural information and text description to solve the problem, aiming at the defects that the prior knowledge graph representation learning method only considers the fact triple symbolic representation in the knowledge graph and ignores some implicit semantic information in the knowledge graph.
A knowledge graph representation learning method based on structural information and text description comprises the following steps:
step 1: acquiring triple information from a preset knowledge base, wherein each triple information comprises a head entity, a relation and a tail entity, processing the acquired triple information by adopting a TransE learning method to respectively obtain representation vectors of the head entity, the relation and the tail entity in each triple information, and the representation vectors of the head entity, the relation and the tail entity in each triple information form a symbol-based representation vector;
step 2: storing each symbol-based expression vector obtained in the step one into a database, and establishing a corresponding index;
and step 3: sequentially inquiring each entity in the preset knowledge base to obtain entities corresponding to the inquired entities respectively serving as a head entity and a tail entity;
and 4, step 4: obtaining an entity set corresponding to each queried entity according to the step 3, and for each queried entity: querying the ids of all entities contained in the corresponding entity set in the preset knowledge base, obtaining the ids of all entities in the corresponding entity set in a random walk mode, and connecting to form an entity id sequence;
and 5: according to each entity id sequence obtained in the specific step 4, learning by respectively adopting a skip-gram model to obtain a representation vector based on a network structure;
step 6: respectively preprocessing the description texts of all entities in the preset knowledge base;
and 7: performing word vector generation on the description texts of the entities preprocessed in the step 6 by adopting a CBOW method in word2vec respectively to obtain expression word vectors respectively;
and 8: for each representative word vector obtained in step 7: respectively taking the expression word vectors as the input of a CNN encoder, setting two convolution layers and two pooling layers, and learning to obtain the expression vectors of each entity based on the description text;
and step 9: respectively splicing the symbolic-based expression vector, the network structure-based expression vector and the description text-based expression vector of the same entity to obtain the spliced vectors of all entities;
step 10: and learning the splicing vector of each entity by adopting a TransE learning method to obtain a final expression vector of each entity.
Further, a subset FB15k of Freebase is used as the preset knowledge base.
Further, the preprocessing in step 6 specifically includes removing stop words, and connecting entity names composed of a plurality of characters as a word.
Further, in the process of generating the word vector in step 7, the dimension size, the min-count, and the sliding window value of each word vector need to be set.
The invention provides a method for adding a network structure into representation learning of a knowledge graph, and aims to effectively fuse vectors of three information sources, namely triple symbolic representation, text description representation and relative network structure together, so as to improve the representation quality of each entity vector, provide vector sources containing more semantic and structural information for upper-layer application, and simultaneously introduce network structure information of relative positions of entities in the knowledge graph into knowledge representation for the first time.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
fig. 1 is a structural diagram of a knowledge graph representation learning method based on structural information and text description according to the present invention.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
A knowledge graph representation learning method based on structural information and text description comprises the following steps:
step 1: acquiring triple information from a preset knowledge base Freebase, wherein the triple information comprises a head entity, a relation and a tail entity, and representing vectors of the head entity, the relation and the tail entity in the triple information are respectively obtained by adopting a TransE learning method to form a symbol-based representing vector (h, r, t), wherein h represents the representing vector of the head entity, r represents the representing vector of the relation, t represents the representing vector of the tail entity, and the dimension of each vector is set to be 100 dimensions; the loss function is as follows:
Figure BDA0001785277000000031
step 2: storing the expression vector based on the symbol obtained in the step one into a database, and establishing a corresponding index;
and step 3: sequentially inquiring each entity in a preset knowledge base to obtain entities and relations corresponding to the inquired entities when the inquired entities are respectively used as a head entity and a tail entity;
and 4, step 4: obtaining a corresponding entity set of the inquired entities according to the step 3, inquiring the ids of all entities contained in the corresponding entity set in a database, obtaining the ids of all entities in the corresponding entity set in a random walk mode, and connecting to form an entity id sequence;
and 5: and (4) according to the entity id sequence obtained in the specific step (4), learning by adopting a skip-gram model to obtain a network structure-based expression vector.
Step 6: selecting an entity description text from a preset knowledge base, and preprocessing the selected entity description text: removing stop words, and connecting entity names consisting of a plurality of characters to be used as a word;
and 7: generating word vectors for the entity description texts preprocessed in the step 6 by adopting a CBOW method in word2vec to obtain expression word vectors; during model training, setting the dimension of each word vector to be 100, and setting a proper min-count and a sliding window value to obtain a representation word vector.
And 8: and (4) taking the expression word vector obtained in the step (7) as the input of a CNN encoder, setting two convolution layers and two pooling layers, and learning to obtain the expression vector based on the description text. In consideration of the fact that the word order problem in the text is ignored in the CBOW model, the patent adopts CNN as a text encoder to perform encoding learning on the description text of each entity on the basis of the CBOW model;
and step 9: splicing the symbol-based representation vector, the network structure-based representation vector and the description text-based representation vector to obtain a splicing vector of the entity:
e=[es:eg:ed]
wherein e issRepresents a structural vector, egRepresenting a representation vector based on a network structure, edRepresenting the entity by a representation vector based on the description text, and e representing the entity splicing vector;
step 10: by adopting a TransE learning method, a joint learning entity represents a vector based on symbols, a vector based on a graph structure and a vector based on a descriptive text, and a scoring function is as follows:
f=‖h+r-t‖
h and t respectively represent vector representations of a head entity and a tail entity, and the values of the vector representations are equal to a symbolic-based representation vector, a network structure-based representation vector and a splicing vector of a description text-based representation vector corresponding to the entities; and substituting the scoring function into a loss function of TransE to participate in model training to obtain a final expression vector of the entity.
Figure BDA0001785277000000041
Wherein, [ x ]]+The number with a large value in the two values of 0 and x is represented, epsilon is a hyperparameter, h represents a head entity vector in the triple, t represents a tail entity vector in the triple, r represents a relation representation vector in the triple, and h ', r ' and t ' respectively represent the head entity, the relation and the tail entity representation vectors in the negative-case triple. S' represents the negative example triple set, and can be obtained through the following formula
S′={(h′,r,t)|h′∈E}∪{(h,r,t′)|t′∈E}∪{(h,r′,t)|r′∈R}
E, R respectively represents the set of entities and relations in the knowledge base, the negative sampling mode of the entities and relations in a triple is that the entities or relations in the non-current triple in the knowledge base are randomly selected to replace the entities or relations in the current triple, and for the entity negative sampling, only one of the head entity or the tail entity is replaced to form a negative example triple each time.
And (4) finally representing the vectors of the entities and the relations in the knowledge graph obtained in the step (10), and applying the vectors to a knowledge graph completion task to verify the effect of the model.
FIG. 1 is a diagram showing an overall framework of a symbolic-based representation vector, a graph structure-based representation vector and a descriptive text-based representation vector of a joint entity, a textual description vector is obtained by inputting a textual descriptor vector corresponding to the entity into a text Encoder, and then a vector representation of the entity in a vector space is obtained by weighted summation of the textual description word vector, the entity structure vector obtained by a symbolic-based representation learning model TransE and a structure information vector based on a relative position, the model takes h + r ≈ t as a learning target, wherein the text vector representation is obtained by training of a text Encoder Encoder, and x is a vector representation of the entity in the vector space1,x2…,xNRepresenting the word vector of each word in the text description of the entity, obtaining a triple vector by training a tuple-based representation method TransE, obtaining a relation vector by training the TransE, obtaining a Network vector by the deepwalk Network embedding mode adopted in the implementation step two, and combining the word vector and the relation vectorThe learning is trained from the target learning function mentioned in the specific step 12.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (4)

1. A knowledge graph representation learning method based on structural information and text description is characterized by comprising the following steps:
step 1: acquiring triple information from a preset knowledge base, wherein each triple information comprises a head entity, a relation and a tail entity, processing the acquired triple information by adopting a TransE learning method to respectively obtain representation vectors of the head entity, the relation and the tail entity in each triple information, and the representation vectors of the head entity, the relation and the tail entity in each triple information form a symbol-based representation vector;
step 2: storing each symbol-based expression vector obtained in the step 1 into a database, and establishing a corresponding index;
and step 3: sequentially inquiring each entity in the preset knowledge base to obtain entities corresponding to the inquired entities respectively serving as a head entity and a tail entity;
and 4, step 4: obtaining an entity set corresponding to each queried entity according to the step 3, and for each queried entity: querying the ids of all entities contained in the corresponding entity set in the preset knowledge base, obtaining the ids of all entities in the corresponding entity set in a random walk mode, and connecting to form an entity id sequence;
and 5: according to each entity id sequence obtained in the specific step 4, learning by respectively adopting a skip-gram model to obtain a representation vector based on a network structure;
step 6: respectively preprocessing the description texts of all entities in the preset knowledge base;
and 7: performing word vector generation on the description texts of the entities preprocessed in the step 6 by adopting a CBOW method in word2vec respectively to obtain expression word vectors respectively;
and 8: for each representative word vector obtained in step 7: respectively taking the expression word vectors as the input of a CNN encoder, setting two convolution layers and two pooling layers, and learning to obtain the expression vectors of each entity based on the description text;
and step 9: respectively splicing the symbolic-based expression vector, the network structure-based expression vector and the description text-based expression vector of the same entity to obtain the spliced vectors of all entities;
step 10: and learning the splicing vector of each entity by adopting a TransE learning method to obtain a final expression vector of each entity.
2. The method for learning representation of knowledge graph based on structural information and text description as claimed in claim 1, wherein a subset FB15k of Freebase is used as the predetermined knowledge base.
3. The method as claimed in claim 1, wherein the preprocessing in step 6 includes removing stop words, and connecting entity names composed of a plurality of characters as a word.
4. The method as claimed in claim 1, wherein in the step 7 of generating the word vectors, the dimension size, min-count and sliding window value of each word vector need to be set.
CN201811011812.0A 2018-08-31 2018-08-31 Knowledge graph representation learning method based on structural information and text description Active CN109299284B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811011812.0A CN109299284B (en) 2018-08-31 2018-08-31 Knowledge graph representation learning method based on structural information and text description

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811011812.0A CN109299284B (en) 2018-08-31 2018-08-31 Knowledge graph representation learning method based on structural information and text description

Publications (2)

Publication Number Publication Date
CN109299284A CN109299284A (en) 2019-02-01
CN109299284B true CN109299284B (en) 2021-07-20

Family

ID=65165826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811011812.0A Active CN109299284B (en) 2018-08-31 2018-08-31 Knowledge graph representation learning method based on structural information and text description

Country Status (1)

Country Link
CN (1) CN109299284B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918162B (en) * 2019-02-28 2021-11-02 集智学园(北京)科技有限公司 High-dimensional graph interactive display method for learnable mass information
CN109871542B (en) * 2019-03-08 2024-03-08 广东工业大学 Text knowledge extraction method, device, equipment and storage medium
CN109992673A (en) * 2019-04-10 2019-07-09 广东工业大学 A kind of knowledge mapping generation method, device, equipment and readable storage medium storing program for executing
CN110119355B (en) * 2019-04-25 2022-10-28 天津大学 Knowledge graph vectorization reasoning general software defect modeling method
CN112559734B (en) * 2019-09-26 2023-10-17 中国科学技术信息研究所 Brief report generating method, brief report generating device, electronic equipment and computer readable storage medium
CN110851620B (en) * 2019-10-29 2023-07-04 天津大学 Knowledge representation method based on text embedding and structure embedding combination
CN111046187B (en) * 2019-11-13 2023-04-18 山东财经大学 Sample knowledge graph relation learning method and system based on confrontation type attention mechanism
CN110955764B (en) * 2019-11-19 2021-04-06 百度在线网络技术(北京)有限公司 Scene knowledge graph generation method, man-machine conversation method and related equipment
CN111198950B (en) * 2019-12-24 2021-10-15 浙江工业大学 Knowledge graph representation learning method based on semantic vector
CN111680163A (en) * 2020-04-21 2020-09-18 国网内蒙古东部电力有限公司 Knowledge graph visualization method for electric power scientific and technological achievements
CN111581392B (en) * 2020-04-28 2022-07-05 电子科技大学 Automatic composition scoring calculation method based on statement communication degree
CN112288091B (en) * 2020-10-30 2023-03-21 西南电子技术研究所(中国电子科技集团公司第十研究所) Knowledge inference method based on multi-mode knowledge graph
CN112541589B (en) * 2020-12-21 2022-10-14 福州大学 Text knowledge embedding method based on AHE alignment hyperplane
CN113032415B (en) * 2021-03-03 2024-04-19 西北工业大学 Personalized product description generation method based on user preference and knowledge graph
CN112784066B (en) * 2021-03-15 2023-11-03 中国平安人寿保险股份有限公司 Knowledge graph-based information feedback method, device, terminal and storage medium
CN113792544B (en) * 2021-07-06 2023-08-29 中国地质大学(武汉) Text emotion classification method and device considering geospatial distribution
CN113488165B (en) * 2021-07-26 2023-08-22 平安科技(深圳)有限公司 Text matching method, device, equipment and storage medium based on knowledge graph
CN114329234A (en) * 2022-03-04 2022-04-12 深圳佑驾创新科技有限公司 Collaborative filtering recommendation method and system based on knowledge graph
CN114817424A (en) * 2022-05-27 2022-07-29 中译语通信息科技(上海)有限公司 Graph characterization method and system based on context information
CN115099504A (en) * 2022-06-29 2022-09-23 中南民族大学 Cultural relic security risk element identification method based on knowledge graph complement model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779085B2 (en) * 2015-05-29 2017-10-03 Oracle International Corporation Multilingual embeddings for natural language processing
CN105824802B (en) * 2016-03-31 2018-10-30 清华大学 It is a kind of to obtain the method and device that knowledge mapping vectorization indicates
JP6862914B2 (en) * 2017-02-28 2021-04-21 富士通株式会社 Analysis program, analysis method and analysis equipment
CN107818164A (en) * 2017-11-02 2018-03-20 东北师范大学 A kind of intelligent answer method and its system

Also Published As

Publication number Publication date
CN109299284A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299284B (en) Knowledge graph representation learning method based on structural information and text description
CN109472033B (en) Method and system for extracting entity relationship in text, storage medium and electronic equipment
CN108182295B (en) Enterprise knowledge graph attribute extraction method and system
CN106649715B (en) A kind of cross-media retrieval method based on local sensitivity hash algorithm and neural network
CN108763376B (en) Knowledge representation learning method for integrating relationship path, type and entity description information
US8612367B2 (en) Learning similarity function for rare queries
CN111581395A (en) Model fusion triple representation learning system and method based on deep learning
CN103559504A (en) Image target category identification method and device
CN111753101A (en) Knowledge graph representation learning method integrating entity description and type
CN111666427A (en) Entity relationship joint extraction method, device, equipment and medium
CN112948546B (en) Intelligent question and answer method and device for multi-source heterogeneous data source
CN111782826A (en) Knowledge graph information processing method, device, equipment and storage medium
CN105631037A (en) Image retrieval method
CN114201684A (en) Knowledge graph-based adaptive learning resource recommendation method and system
CN113821527A (en) Hash code generation method and device, computer equipment and storage medium
CN114780723B (en) Portrayal generation method, system and medium based on guide network text classification
CN115658846A (en) Intelligent search method and device suitable for open-source software supply chain
Lonij et al. Open-world visual recognition using knowledge graphs
CN114519107A (en) Knowledge graph fusion method combining entity relationship representation
CN109857892A (en) Semi-supervised cross-module state Hash search method based on category transmitting
CN111309930B (en) Medical knowledge graph entity alignment method based on representation learning
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
US20230350913A1 (en) Mapping of unlabeled data onto a target schema via semantic type detection
CN114912458A (en) Emotion analysis method and device and computer readable medium
CN114372454A (en) Text information extraction method, model training method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant