CN112765369A - Knowledge graph information representation learning method, system, equipment and terminal - Google Patents

Knowledge graph information representation learning method, system, equipment and terminal Download PDF

Info

Publication number
CN112765369A
CN112765369A CN202110134685.9A CN202110134685A CN112765369A CN 112765369 A CN112765369 A CN 112765369A CN 202110134685 A CN202110134685 A CN 202110134685A CN 112765369 A CN112765369 A CN 112765369A
Authority
CN
China
Prior art keywords
path
entity
model
reliability
knowledge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110134685.9A
Other languages
Chinese (zh)
Inventor
易运晖
周小寒
何先灯
权东晓
朱畅华
赵楠
陈南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110134685.9A priority Critical patent/CN112765369A/en
Publication of CN112765369A publication Critical patent/CN112765369A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of knowledge graphs and discloses a knowledge graph information representation learning method, a knowledge graph information representation learning system, knowledge graph information representation learning equipment and a knowledge graph information representation learning terminal, wherein the knowledge graph information representation learning method comprises the following steps: preprocessing according to a path constraint resource allocation method; calculating the reliability of all paths, and outputting the reliability to a training set and a test set; initializing a model and setting parameters; generating a triple according to the iterator, and randomly replacing head and tail entities; calculating a loss function of the triplet according to the score function; calculating a loss function of the additional path according to the path reliability; performing parameter optimization by using an Adam method; model validation is performed using entity prediction and relationship prediction. The invention considers rich path information contained in the knowledge graph, is beneficial to improving the modeling effect of the entity and the relation, can optimize the modeling of the relation by putting the vector into a complex plane and expressing the calculation of the vector by rotation, and can be used for a link prediction and recommendation system.

Description

Knowledge graph information representation learning method, system, equipment and terminal
Technical Field
The invention belongs to the technical field of knowledge graphs, and particularly relates to a knowledge graph information representation learning method, a knowledge graph information representation learning system, knowledge graph information representation learning equipment and a knowledge graph information representation learning terminal.
Background
Google currently proposed the concept of a knowledge graph in 2012, aiming to represent unstructured or semi-structured information in the internet as structured knowledge. The knowledge graph provides opportunities for intellectual organization and intelligent application in the internet era by virtue of strong information processing capacity and open organizational capacity of the knowledge graph, and is widely applied to the aspects of semantic intelligent search, personalized recommendation, intellectual intelligent question answering and the like. The knowledge graph is developed from a semantic network and essentially is a directed graph consisting of entities and relations, wherein each entity is used as a node of the directed graph, each relation is used as an edge of the directed graph, and each knowledge is represented in a form of a triple (entity, relation, entity). Based on the representation mode of the directed graph, the related research and application of the traditional knowledge graph are often completed by means of a graph algorithm, and the following two problems are faced: on the aspect, the large-scale knowledge graph often faces the problem of data sparsity, and a good effect is difficult to achieve by using a graph algorithm; on the other hand, the graph algorithm is high in calculation complexity and low in calculation efficiency, and cannot meet the application requirements of large-scale knowledge graphs.
The occurrence of knowledge graph representation learning alleviates the above problems, and the core idea is to represent the entities and relations in the knowledge graph as real-valued vectors in a low-dimensional continuous space, and measure the semantic relation between the entities and relations. The vector representation of the entities and the relations is obtained through the method, and the method can be used for calculating the semantic similarity between the entities, predicting the relations between the two entities and facilitating the expansion of the method to various researches and applications of the knowledge graph.
In the past decade, a large number of knowledge maps, such as Free Base, DBpedia and YAGO, have emerged that store a large number of complex, structured facts in the real world. A typical data model of a knowledge graph is based on rdf (resource Description framework), which represents structured facts in the form of (head entry, relation, tail entry) triples, such as (Thomas Alva Edison, invented, electric light). In the past decade, symbol and logic based knowledge-graph completion works very much, but for large-scale knowledge-graphs they are neither easy to handle nor sufficiently convergent. With the continuous development of the internet, the scale of the knowledge graph is continuously increased, and how to realize efficient representation and calculation on the large-scale knowledge graph becomes a crucial problem. With the rapid development of deep learning, the study shows that the head and horn are exposed gradually, important progress is achieved, unusual performances are obtained in a plurality of fields, and the wide research enthusiasm is caused.
In recent years, researchers have proposed a variety of approaches to automatically construct or populate a knowledge base from plain text, and have done much work and research in knowledge graph representation learning, all with the goal of encoding entities and relationships between entities into a continuous low-dimensional vector space, capturing the complexity and semantics inherent in data in knowledge graphs. Representation learning can describe potential semantic information and predict new missing factual relationships. By using the representation, useful information can be extracted more conveniently, semantic relevance can be calculated more conveniently, and further a lot of related applications are supported. Representing the study of the heteropathy, researchers have proposed a variety of knowledge graph representation learning models. Bordes and Weston et al propose a distance-based model SE (structured embedding) in 2011, which is one of the models for learning of earlier proposed knowledge graph representations, wherein the model constructs two projection matrixes for different head and tail entities for specific relations, and SE cannot accurately depict the relation between the two entities due to the use of two independent matrixes. In 2014, Bordes and Glorot et al proposed a semantic matching energy model SME (semantic matching energy, which represents each relationship by a vector and interacts with the entity vector by a plurality of matrix products and Hadamard products to capture the precise relationship between the entity and the relationship, and the operation is more complicated By demonstrating the learning method, almost all embedded models need higher time complexity and larger memory space, so that the method is difficult to show better effect on a large-scale knowledge graph. Bordes et al propose a Translation-based model, TransE (Translation embedding for modeling multi-relation data), which is a representative model of knowledge representation learning, that treats relationships as a Translation operation from head to tail entities, similar to word-embedding word analogy tasks and sentence-based relation classification of sentences. This idea is derived from word2vec, the authors of word2vec have studied to find that the word vector space has features with unchanged translation values, such as v (husband) -v (wife) ≈ v (man) -v (wman), v (x) represents the word vector that word x learns from word2vec, i.e. the relationship between entity and entity corresponds to a translation in the potential feature space. TransE achieves a good balance of computational efficiency and accuracy for large-scale knowledge graph representation learning. TransE states that when (h, r, t) is a correct triplet, the tail entity's embedded vector should be close to the sum of the head entity's and the relationship's embedded vector, and the negative sample is the opposite. The TransE adopts a maximum interval method to design an energy loss function training model, and improves the capability of representing learning to distinguish positive and negative samples. Negative examples samples are constructed by randomly replacing one of the head entity h, the relationship r, and the tail entity t in each triplet of the dataset with another entity or relationship, rather than being generated randomly. TransE is approved by the industry since the self-proposition, the model parameters are few, the method is simple and efficient, the interpretability is strong, the defect of complex traditional training is perfectly overcome, and compared with the traditional model, the performance of TransE is remarkably improved. However, TransE is too simple, and the entities and the relations are represented by single vectors, so that complicated relations such as reflexive relations 1-N, N-1 and N-N cannot be accurately described. Therefore, there are many subsequent studies that have been improved and extended based on Trans E. For example, as proposed by Trans H, the head entity H and the tail entity t are projected onto the hyperplane corresponding to the relationship r, and the translation operation from the head entity to the tail entity is realized on the hyperplane. TransH allows entities to have different representations under different relationships, and overcomes the limitation of TransE in processing relationships such as reflexive, 1-N, N-1 and N-N. The head entity h and the tail entity t are projected into a relation space corresponding to the relation r by the TransR, the translation operation from the head entity to the tail entity is realized in the relation space, the entity and the relation are modeled in different semantic spaces, and various attributes of the entity can be accurately described. TransD captures multiple attribute types of entities and relationships simultaneously by defining a dynamic mapping matrix. PTransE proposes a way of considering multi-step relational path modeling for representation learning, and the utilization of rich path information is very helpful for alleviating the problem of data sparsity.
In addition, there are many studies using other knowledge related to the knowledge graph to represent learning, not just structural information, such as the DKRL (description-aided knowledge representation learning) model, which proposes a way to introduce description information of an entity into the representation learning. The IKRL (Image-aided Knowledge retrieval Learning model) extracts rich visual information from Image information of an entity and performs presentation Learning. The information sources for representing learning are not limited to inherent structural information any more, the expressive ability of knowledge representation learning can be effectively improved by using the multi-source heterogeneous information, and the problems of data sparseness and overfitting are effectively relieved.
However, the typical work of current knowledge graph representation learning is a translation model represented by TransE, and vector representation of entities and relations is obtained by learning triple structure information. However, most of the existing methods regard triples as an independent set of triples, and association of the triples on the graph is not considered from the perspective of the graph, and on the other hand, most of the existing technologies are not perfect in modeling the relationship.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) the traditional knowledge graph needs to be completed by means of a graph algorithm, but a large-scale knowledge graph often faces the problem of data sparsity, and a good effect is difficult to achieve by utilizing the graph algorithm; meanwhile, the graph algorithm is high in calculation complexity and low in calculation efficiency, and cannot meet the application requirements of large-scale knowledge graphs.
(2) Knowledge-graphs based on symbols and logic are neither easy to handle nor sufficiently convergent for large-scale knowledge-graphs; the embedded models require high time complexity and large memory space, and therefore, the embedded models are difficult to show good effects on large-scale knowledge maps.
(3) The distance-based model SE cannot accurately depict the connection between two entities due to the use of two independent matrices; the operation of the semantic matching energy model SME is more complex; the complexity of the tensor-based model NTN is very high.
(4) In the tensor-based model, when the scale of the knowledge graph is increased continuously, the dimensionality of the tensor is increased, and the calculation complexity is increased, so that the tensor-based model cannot show a good effect on the representation learning of the large-scale knowledge graph.
(5) The translation-based model TransE is too simple, and the entities and the relations are represented by single vectors and cannot accurately depict the complicated relations of reflexions, 1-N, N-1 and N-N.
(6) At present, most methods regard triples as an independent triple set, association of the triples on a diagram is not considered from the perspective of the diagram, and modeling of the relationship in the prior art is mostly not perfect.
The difficulty in solving the above problems and defects is: the comprehensive consideration of each triplet can increase the time complexity and the space complexity of model training, so that the training time is longer and the occupied space is larger.
The significance of solving the problems and the defects is as follows: the problem that the triples are regarded as the set of independent triples is solved, and the development of rich additional information in the knowledge graph in the training process is facilitated, so that the trained model is more hierarchical, and the included semantics are richer.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a knowledge graph information representation learning method, a knowledge graph information representation learning system, knowledge graph information representation learning equipment and a knowledge graph information representation learning terminal.
The invention is realized in such a way that a knowledge-graph information representation learning method comprises the following steps: generating a reverse relation, acquiring single-step path information and multi-step path information, and preprocessing a data set; reading a training set and a test set, establishing a model, and training the model; and (5) testing the model.
Further, the knowledge-graph information representation learning method comprises the following steps:
step one, preprocessing according to a path constraint resource allocation method;
calculating the reliability of all paths, and outputting the reliability to a training set and a test set;
initializing a model and setting parameters;
generating a triple according to the iterator, and randomly replacing head and tail entities;
step five, calculating a loss function of the triad according to the score function;
step six, calculating a loss function of the extra path according to the path reliability;
seventhly, optimizing parameters by using an Adam method;
and step eight, performing model verification by using entity prediction and relation prediction.
The method mainly and actively comprises the steps of firstly, secondly and sixthly, obtaining the reliability of the path through a path constraint resource allocation method, and training by fully utilizing the reliability of the path as a reference in model training, so that the model comprehensively considers the rotation and the extra path, and is beneficial to model modeling.
Further, in the first step, the preprocessing according to the path constraint resource allocation method includes:
(1) generating inverse relationships
And reading a file corresponding to the id of the relationship, acquiring the total relationship quantity, and taking the id of the relationship r and the total relationship quantity n to obtain a new idr + n as the inverse relationship of the relationship r.
(2) Obtaining single step path information
Reading train.txt and text.txt to obtain all correct forward triples, adding a reverse relation to all correct triples to generate a new reverse triple, and taking the forward triples and the reverse triples as all correct triples. All head entities are obtained by circularly traversing the triples, and then the relation of each head entity and the corresponding tail entity are traversed. All entity pairs in the training data are stored in the vector table by the above operation. Since the single-step path between two entities is not necessarily unique, the probability of all existing paths from entity X to entity Y is stored into a vector table.
(3) Obtaining multi-step path information
We have extracted the entity pairs for which there is a single step path between all entities through the above steps. And traversing the head entity e1 in the triples to obtain the relationship of all the head entities e1 and the tail entity e 2. Then, e2 is used as a head node to search the relation of e2 and a tail node e3, e1 and e3 are connected as a new path, the path is stored in a table, and for the paths which are not unique, the probability of each path is stored in the table.
(4) Data set preprocessing
And writing the calculated data of the path reliability into a confidence.txt file, and writing all the path reliabilities corresponding to the triples of the training set and the test set into a train _ pra.txt file and a test _ pra.txt file for model training.
Further, in the step (3), path constraint resources PCRA are adopted, and the reliability of p between h and t is measured by measuring the number of resources which finally reach the tail node t by assuming that a plurality of paths p are passed from one head entity t. The mathematical expression is as follows:
Figure BDA0002925720900000061
starting from the head entity h, according to the path S0→S1→S2...→SlIn which S is0=h,t∈Sl(Note that S is a set, there may be multiple tail nodes, SlIs a collection of tail nodes). Then for the entity m e SiThe set of the last node defining m is Si-1(. m), n is one of the nodes (entities). Si(n, ·) is then the set of nodes next to node n (entity). Rp(n) is a resource obtained from entity n. We define a head node Rp(h) 1, R of last tail node tpThe value of (t) then represents how much information the path p can transfer from the head node h to the tail node t, i.e. the reliability of the path p. R (p | h, t) is calculated given head node h and tail node tp(t)。
Further, the reading of the training set and the test set, the building of the model, and the training of the model includes:
(1) reading training set and test set
And reading the triples in the train _ pra.txt file and the test _ pra.txt file into a memory as a correct triple data set, and simultaneously reading the reliability of the path corresponding to each triple into the memory. And (4) segmenting the triples of the training set according to epoch, and storing the triples into an iterator.
(2) Establishing a rotation-based knowledge graph representation learning model considering path information, wherein a score function is as follows:
Figure BDA0002925720900000071
wherein the content of the first and second substances,
Figure BDA0002925720900000072
representing a Hadamard product, i.e. the product of elements.
(3) Training model
1) Setting parameters: the embedding dimension is set to be 1000, the batch size is set to be 512, the learning rate is 0.001, and the training steps are 200000 times.
2) Initializing a model: and uniformly initializing the entity vectors and uniformly initializing the relation vectors to be between 0 and 2 pi.
3) Starting to train the model:
the iterator generates a set of correct triples (h, r, t), randomly replaces the head entity (h ', r, t) or the tail entity (h, r, t'), and trains the extra path (p, r) for vector update by computing a loss function l(s), as follows:
Figure BDA0002925720900000081
l(s) is divided into two parts, one is rotation-based penalty function calculation and the other is calculation of a penalty function considering extra paths, using Adam's method for parameter optimization and vector update.
Further, in step (2), the score function of the model is composed of two parts, one part is a rotation-based knowledge graph representation model, the embedding score of the triplet itself is considered, the other part is the score of the additional path information, Z is a normalization factor, R is the reliability of the relationship path p for the entity (h, t), and the reliability calculation is derived from the PCRA above. For section E (h, p, t), considering the scores of the multi-step paths h to t, the model tries two operations, vector addition, vector multiplication and vector rotation, on the combination of the paths. Since the rotation-based model is different from the translation-based model, the mode of vector rotation is additionally considered on the basis of the traditional vector addition and vector multiplication, and the modular length | r of the relationshipi1, the different relationships for the model are combined in relationship eThe value of θ is added or subtracted, that is, the combination of the paths corresponds to the change of the rotation angle.
Further, in step eight, the performing model verification by using entity prediction and relationship prediction includes:
the entity prediction method comprises the following steps: reading triples of a test set, replacing a head entity and a tail entity for each triplet, calculating a score function for the replaced triples, ranking the obtained results in an ascending order, averaging the rankings of all the triples to obtain an entity prediction average ranking mean rank of the model, and performing frequency calculation on the times of the correct triples in the first 1, the first 5 and the first 10 to obtain hit rates Hits @1, Hits @3 and Hits @ 10. The same is true for relation prediction.
Another object of the present invention is to provide a knowledge-graph information representation learning system to which the knowledge-graph information representation learning method is applied, the knowledge-graph information representation learning system including:
the preprocessing module is used for generating a reverse relation, acquiring single-step path information and multi-step path information, and preprocessing according to a path constraint resource allocation method;
the training set and test set reading module is used for calculating the reliability of all paths and outputting the reliability to the training set and test set;
the model construction module is used for constructing a knowledge graph representation learning model considering path information based on rotation;
the model training module is used for initializing a model and setting parameters; generating a triple according to the iterator, and randomly replacing head and tail entities; calculating a loss function of the triple according to the score function, calculating a loss function of an additional path according to the path reliability, and performing parameter optimization by using an Adam method;
and the model testing module is used for verifying the accuracy of the model through entity prediction and relation prediction.
It is another object of the present invention to provide a computer program product stored on a computer readable medium, comprising a computer readable program for providing a user input interface to implement the method of knowledge-graph information representation learning when executed on an electronic device.
It is another object of the present invention to provide a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the method of knowledge-graph information representation learning.
By combining all the technical schemes, the invention has the advantages and positive effects that: the knowledge graph information representation learning method provided by the invention can be used for link prediction and recommendation systems, and by considering rich path information in the knowledge graph, vectors are put into a complex plane space, so that more additional information can be obtained by knowledge representation learning, and meanwhile, the modeling of complex relations is optimized, and the representation capability of the model is improved. The invention considers rich path information contained in the knowledge graph, is beneficial to improving the modeling effect of the entity and the relationship, and can optimize the modeling of the relationship by putting the vector into a complex plane and representing the calculation of the vector by rotation.
The reliability of the path is calculated by using the path resource constraint distribution, so that the information of the additional path can be synthesized for modeling when the model is trained; when the path is modeled, different relations of the model are combined in the relation eThe value of θ is added or subtracted, that is, the combination of the paths corresponds to the change of the rotation angle.
According to the method, the information of the extra path is considered during model training, instead of training by a single triple, so that the consideration is more sufficient; the representation of the combinatorial relationship can be optimized by rotational modeling with vectors injected into the complex plane. Meanwhile, the invention can adopt RNN mode to model in the processing mode of path combination; in the path resource constraint algorithm, the reliability calculation for the path may not only consider the passing relationship information, but also additionally consider the passing entity information.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a knowledge-graph information representation learning method provided by an embodiment of the invention.
FIG. 2 is a schematic diagram of a knowledge-graph information representation learning method provided by an embodiment of the invention.
FIG. 3 is a block diagram of a knowledge-graph information representation learning system architecture provided by an embodiment of the present invention;
in the figure: 1. a preprocessing module; 2. a training set and test set reading module; 3. a model building module; 4. a model training module; 5. and a model testing module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides a knowledge graph information representation learning method, a knowledge graph information representation learning system, knowledge graph information representation learning equipment and a knowledge graph information representation learning terminal, and the invention is described in detail below by combining the accompanying drawings.
As shown in fig. 1, the knowledge-graph information representation learning method provided by the embodiment of the present invention includes the following steps:
s101, preprocessing according to a path constraint resource allocation method;
s102, calculating the reliability of all paths, and outputting the reliability to a training set and a test set;
s103, initializing a model and setting parameters;
s104, generating a triple according to the iterator, and randomly replacing head and tail entities;
s105, calculating a loss function of the triad according to the score function;
s106, calculating a loss function of the extra path according to the path reliability;
s107, using an Adam method to optimize parameters;
and S108, performing model verification by using entity prediction and relation prediction.
Those skilled in the art of knowledge-graph information representation learning method provided by the present invention can also implement other steps, and the knowledge-graph information representation learning method provided by the present invention in fig. 1 is only a specific example.
A schematic diagram of a knowledge-graph information representation learning method provided by the embodiment of the invention is shown in fig. 2.
As shown in fig. 3, the knowledge-graph information representation learning system provided by the embodiment of the present invention includes:
the preprocessing module 1 is used for generating a reverse relation, acquiring single-step path information and multi-step path information, and preprocessing according to a path constraint resource allocation method;
a training set and test set reading module 2 for calculating the reliability of all paths and outputting the reliability to a training set and a test set;
the model construction module 3 is used for constructing a knowledge graph representation learning model considering path information based on rotation;
the model training module 4 is used for initializing a model and setting parameters; generating a triple according to the iterator, and randomly replacing head and tail entities; calculating a loss function of the triple according to the score function, calculating a loss function of an additional path according to the path reliability, and performing parameter optimization by using an Adam method;
and the model testing module 5 is used for verifying the accuracy of the model through entity prediction and relation prediction.
The technical solution of the present invention is further described with reference to the following examples.
1. Pretreatment of
1.1 generating inverse relationships
And reading a file of the id corresponding to the relationship, acquiring the total relationship quantity, and taking the new id r + n obtained by adding the id of the relationship r to the total relationship quantity n as the inverse relationship of the relationship r.
1.2 obtaining Single step Path information
Reading train.txt and text.txt to obtain all correct forward triples, adding a reverse relation to all correct triples to generate a new reverse triple, and taking the forward triples and the reverse triples as all correct triples. All head entities are obtained by circularly traversing the triples, and then the relation of each head entity and the corresponding tail entity are traversed. All entity pairs in the training data are stored in the vector table by the above operation. Since the single-step path between two entities is not necessarily unique, the probability of all existing paths from entity X to entity Y is stored into a vector table.
1.3 obtaining Multi-step Path information
We have extracted the entity pairs for which there is a single step path between all entities through the above steps. And traversing the head entity e1 in the triples to obtain the relationship of all the head entities e1 and the tail entity e 2. Then, e2 is used as a head node to search the relation of e2 and a tail node e3, e1 and e3 are connected as a new path, the path is stored in a table, and similarly, for the paths which are not unique, the probability of each path is stored in the table.
Some of these paths are unreliable and require computation of the reliability of the path. The method adopts the idea of Path-constrained resource PCRA (Path-constrained resource allocation) that, assuming that a plurality of paths p are passed from a head entity t, the number of resources which finally reach a tail node t is measured to measure the reliability of p between h and t. The mathematical expression is as follows:
Figure BDA0002925720900000121
starting from the head entity h, according to the path S0→S1→S2...→SlIn which S is0=h,t∈Sl(Note that S is a set, there may be multiple tail nodes, SlIs a collection of tail nodes). Then for the entity m e SiThe set of the last node defining m is Si-1(. m), n is one of the nodes (entities). Si(n, ·) is then the set of nodes next to node n (entity). Rp(n) is a resource obtained from entity n. We define a head node Rp(h) 1, R of last tail node tpThe value of (t) then represents how much information the path p can transfer from the head node h to the tail node t, i.e. the reliability of the path p. R (p | h, t) is calculated given head node h and tail node tp(t)。
1.4 dataset preprocessing
And writing the calculated data of the path reliability into a confidence.txt file, and writing all the path reliabilities corresponding to the triples of the training set and the test set into a train _ pra.txt file and a test _ pra.txt file for model training.
2. Model training
2.1 reading training and test sets
And reading the triples in the train _ pra.txt file and the test _ pra.txt file into a memory as a correct triple data set, and simultaneously reading the reliability of the path corresponding to each triple into the memory. And (4) segmenting the triples of the training set according to epoch, and storing the triples into an iterator.
2.2 modeling
The model of the method is a rotation-based knowledge graph representation learning model considering path information. The scoring function is:
Figure BDA0002925720900000131
wherein the content of the first and second substances,
Figure BDA0002925720900000132
representing a Hadamard product (element product).
The score function of the model of the method is composed of two parts, one part is a rotation-based knowledge graph representation model, the embedding score of the triples is considered, the score of the additional path information is considered, Z is a normalization factor, R is the reliability of the entity (h, t) relationship path p, and the reliability calculation is obtained by the PCRA. For section E (h, p, t), considering the scores of the multi-step paths h to t, the model tries two operations, vector addition, vector multiplication and vector rotation, on the combination of the paths. Since the rotation-based model is different from the translation-based model, the mode of vector rotation is additionally considered on the basis of the traditional vector addition and vector multiplication, and the modular length | r of the relationshipi1, the different relationships for the model are combined in relationship eBy adding or subtracting values of theta, i.e. combinations of paths corresponding to rotation anglesA change in (c).
2.3 training models
2.3.1 setting parameters
The embedding dimension is set to be 1000, the batch size is set to be 512, the learning rate is 0.001, and the training steps are 200000 times.
2.3.2 initializing models
And uniformly initializing the entity vectors and uniformly initializing the relation vectors to be between 0 and 2 pi.
2.3.3 Start training model
The iterator generates a set of correct triples (h, r, t), randomly replaces the head entity (h ', r, t) or the tail entity (h, r, t'), and trains the extra path (p, r) for vector update by computing a loss function l(s), as follows:
Figure BDA0002925720900000141
l(s) is divided into two parts, one is rotation-based penalty function calculation and the other is calculation of a penalty function considering extra paths, using Adam's method for parameter optimization and vector update.
3. Model testing
The accuracy of the model is verified through entity prediction and relationship prediction. The entity prediction method comprises the following steps: reading triples of a test set, replacing a head entity and a tail entity for each triplet, calculating a score function for the replaced triples, ranking the obtained results in an ascending order, averaging the rankings of all the triples to obtain an entity prediction average ranking mean rank of the model, and performing frequency calculation on the times of the correct triples in the first 1, the first 3 and the first 10 to obtain hit rates Hits @1, Hits @3 and Hits @ 10. The same is true for relation prediction.
4. Key point and point to be protected of the invention
(1) And calculating the reliability of the path by using the path resource constraint allocation, so that the model can be modeled by synthesizing the information of the additional path during model training.
(2) In the modeling of the path, the modelDifferent relationships of the type are combined in relationship eThe value of θ is added or subtracted, that is, the combination of the paths corresponds to the change of the rotation angle.
5. Advantages of the invention
(1) The model training considers the information of the extra paths instead of training by a single triple, so that the consideration is more sufficient.
(2) The representation of the combinatorial relationship can be optimized by rotational modeling with vectors injected into the complex plane.
6. Alternatives
(1) In the processing mode of path combination, the modeling can be performed by adopting an RNN mode.
(2) In the path resource constraint algorithm, the reliability calculation for the path may not only consider the passing relationship information, but also additionally consider the passing entity information.
The technical effects of the present invention will be described in detail with reference to simulations.
1 simulation Condition
The computer used in the simulation experiment of the invention is configured as follows: the processor is an Intel (R) core (TM) i7-7700 CPU, the video card is NVIDIA GeForce GTX 1080Ti, the video memory is 11GB, the computer operating system is Ubuntu 20.04.1LTS, and a pytorch deep learning framework is used for realizing simulation experiments.
2 data set
FB15 k: freebase FB15k data is a series of triples (synonym set, relationship type, triples) extracted from Freebase (http:// www.freebase.com). There are 14951 entities and 1345 relationship types. The training set contains 483142 triplets, the validation set is 50000, and the test set is 59071. All triples are unique and the set of synonyms that appear in the validation set and the test set also appear in the training set.
3, simulation content and result analysis:
the training set is sent to the model for learning and the verification set for verification, then the samples in the testing set are sent to the trained model for testing, and the final result is as follows:
FB15k
Figure BDA0002925720900000151
Figure BDA0002925720900000161
it can be seen that the method of the invention has certain effectiveness.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A knowledge-graph information representation learning method, characterized in that the knowledge-graph information representation learning method comprises: preprocessing according to a path constraint resource allocation method; calculating the reliability of all paths, and outputting the reliability to a training set and a test set; initializing a model and setting parameters; generating a triple according to the iterator, and randomly replacing head and tail entities; calculating a loss function of the triplet according to the score function; calculating a loss function of the additional path according to the path reliability; performing parameter optimization by using an Adam method; model validation is performed using entity prediction and relationship prediction.
2. The method of knowledge-graph information representation learning of claim 1 wherein said preprocessing according to a path constrained resource allocation method comprises:
(1) generating a reverse relation, reading a file corresponding to the id of the relation, acquiring the total relation quantity, and taking a new idr + n obtained by adding the id of the relation r to the total relation quantity n as the inverse relation of the relation r;
(2) acquiring single-step path information, reading train.txt and text.txt to obtain all correct forward triples, adding a reverse relation to all correct triples to generate new reverse triples, and taking the forward triples and the reverse triples as all correct triples; all head entities are obtained by circularly traversing the triples, and then the relation of each head entity and the corresponding tail entity are traversed; storing all entity pairs in the training data into a vector table through operation; storing the probability of all existing paths from the entity X to the entity Y into a vector table;
(3) acquiring multi-step path information, and extracting entity pairs with single-step paths among all entities through the steps; traversing the head entity e1 in the triple to obtain the relationship of all head entities e1 and the tail entity e 2; then, e2 is used as a head node to search the relation of e2 and a tail node e3, e1 and e3 are used as a new path to be connected, the path is stored in a table, and for the path which is not unique, the probability of each path is stored in the table;
(4) and preprocessing a data set, writing the calculated data of the path reliability into a confidence.txt file, and writing all the path reliabilities corresponding to the triples of the training set and the test set into a train _ pra.txt file and a test _ pra.txt file for model training.
3. The knowledge-graph information representation learning method of claim 2, wherein in the step (3), path constraint resources PCRA are adopted, and the reliability of p between h and t is measured by assuming that a plurality of paths p are passed from a head entity t and the number of resources which finally reach a tail node t is measured, and the mathematical expression is as follows:
Figure FDA0002925720890000021
starting from the head entity h, according to the path S0→S1→S2...→SlIn which S is0=h,t∈SlS is a set, there may be more tail nodes, SlIs a collection of tail nodes; then for the entity m e SiThe set of the last node defining m is Si-1(. m), n is one of the nodes (entities); si(n, ·) is then the set of nodes next to node n (entity); rp(n) is a resource obtained from entity n; definition head node Rp(h) 1, R of last tail node tpThe value of (t) represents how much information the path p can transmit from the head node h to the tail node t, i.e. the reliability of the path p; r (p | h, t) is calculated given head node h and tail node tp(t)。
4. The method of knowledge-graph information representation learning of claim 1, wherein reading a training set and a test set, modeling and training a model comprises:
(1) reading training set and test set
Reading triples in a train _ pra.txt file and a test _ pra.txt file into a memory as a correct triple data set, and simultaneously reading the reliability of a path corresponding to each triple into the memory; dividing the triples of the training set according to epoch, and storing the triples in an iterator;
(2) establishing a rotation-based knowledge graph representation learning model considering path information, wherein a score function is as follows:
G(h,r,t)=E(h,r,t)+E(h,P,t),
Figure FDA0002925720890000022
Figure FDA0002925720890000023
wherein the content of the first and second substances,
Figure FDA0002925720890000024
representing a Hadamard product, i.e. the product of elements;
(3) training model
1) Setting parameters: setting the embedding dimension to be 1000, the batch size to be 512, the learning rate to be 0.001 and the training steps to be 200000 times;
2) initializing a model: uniformly initializing the entity vectors, and uniformly initializing the relation vectors to be between 0 and 2 pi;
3) starting to train the model:
the iterator generates a set of correct triples (h, r, t), randomly replaces the head entity (h ', r, t) or the tail entity (h, r, t'), and trains the extra path (p, r) for vector update by computing a loss function l(s), as follows:
Figure FDA0002925720890000031
Figure FDA0002925720890000032
Figure FDA0002925720890000033
l(s) is divided into two parts, one is rotation-based penalty function calculation and the other is calculation of a penalty function considering extra paths, using Adam's method for parameter optimization and vector update.
5. The method of knowledge-graph information representation learning of claim 4, wherein in step (2), the score function of the model is composed of two parts, one part being a rotation-based knowledge-graph representation model, taking into account the embedding score of the triples themselves, the other part taking into account the score of the additional path information, Z being a normalization factor, R being the reliability for the entity (h, t) relationship path p, the calculation of reliability being derived from the above PCRA; for part E (h, p, t), the scores of multi-step paths from h to t are considered, and the model tries two operations, vector addition, vector multiplication and vector rotation on the combination of the paths; since the rotation-based model is different from the translation-based model, the mode of vector rotation is additionally considered on the basis of the traditional vector addition and vector multiplication, and the modular length | r of the relationshipi1, the different relationships for the model are combined in relationship eThe value of θ is added or subtracted, that is, the combination of the paths corresponds to the change of the rotation angle.
6. The method of knowledge-graph information representation learning of claim 1, wherein the model validation using entity prediction and relationship prediction comprises: the entity prediction method comprises the following steps: reading triples of a test set, replacing a head entity and a tail entity for each triplet, calculating a score function for the replaced triples, ranking the obtained results in an ascending order, averaging the rankings of all the triples to obtain an entity prediction average ranking mean of the model, and performing frequency calculation on the times of the correct triples in the first 1, the first 5 and the first 10 to obtain hit rates Hits @1, Hits @3 and Hits @ 10; the same is true for relation prediction.
7. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of: preprocessing according to a path constraint resource allocation method; calculating the reliability of all paths, and outputting the reliability to a training set and a test set; initializing a model and setting parameters; generating a triple according to the iterator, and randomly replacing head and tail entities; calculating a loss function of the triplet according to the score function; calculating a loss function of the additional path according to the path reliability; performing parameter optimization by using an Adam method; model validation is performed using entity prediction and relationship prediction.
8. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of: preprocessing according to a path constraint resource allocation method; calculating the reliability of all paths, and outputting the reliability to a training set and a test set; initializing a model and setting parameters; generating a triple according to the iterator, and randomly replacing head and tail entities; calculating a loss function of the triplet according to the score function; calculating a loss function of the additional path according to the path reliability; performing parameter optimization by using an Adam method; model validation is performed using entity prediction and relationship prediction.
9. A knowledge-graph information data processing terminal, characterized in that the knowledge-graph information data processing terminal is used for implementing the knowledge-graph information representation learning method of any one of claims 1 to 6.
10. A knowledge-graph information representation learning system to which the knowledge-graph information representation learning method according to any one of claims 1 to 6 is applied, the knowledge-graph information representation learning system comprising:
the preprocessing module is used for generating a reverse relation, acquiring single-step path information and multi-step path information, and preprocessing according to a path constraint resource allocation method;
the training set and test set reading module is used for calculating the reliability of all paths and outputting the reliability to the training set and test set;
the model construction module is used for constructing a knowledge graph representation learning model considering path information based on rotation;
the model training module is used for initializing a model and setting parameters; generating a triple according to the iterator, and randomly replacing head and tail entities; calculating a loss function of the triple according to the score function, calculating a loss function of an additional path according to the path reliability, and performing parameter optimization by using an Adam method;
and the model testing module is used for verifying the accuracy of the model through entity prediction and relation prediction.
CN202110134685.9A 2021-01-31 2021-01-31 Knowledge graph information representation learning method, system, equipment and terminal Pending CN112765369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110134685.9A CN112765369A (en) 2021-01-31 2021-01-31 Knowledge graph information representation learning method, system, equipment and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110134685.9A CN112765369A (en) 2021-01-31 2021-01-31 Knowledge graph information representation learning method, system, equipment and terminal

Publications (1)

Publication Number Publication Date
CN112765369A true CN112765369A (en) 2021-05-07

Family

ID=75704531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110134685.9A Pending CN112765369A (en) 2021-01-31 2021-01-31 Knowledge graph information representation learning method, system, equipment and terminal

Country Status (1)

Country Link
CN (1) CN112765369A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190691A (en) * 2021-05-28 2021-07-30 齐鲁工业大学 Link prediction method and system of knowledge graph
CN113312854A (en) * 2021-07-19 2021-08-27 成都数之联科技有限公司 Type selection recommendation method and device, electronic equipment and readable storage medium
CN113360670A (en) * 2021-06-09 2021-09-07 山东大学 Knowledge graph completion method and system based on fact context
CN113836321A (en) * 2021-11-30 2021-12-24 北京富通东方科技有限公司 Method and device for generating medical knowledge representation
CN113901151A (en) * 2021-09-30 2022-01-07 北京有竹居网络技术有限公司 Method, apparatus, device and medium for relationship extraction
WO2023130960A1 (en) * 2022-01-07 2023-07-13 中国电信股份有限公司 Service resource determination method and apparatus, and service resource determination system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190691A (en) * 2021-05-28 2021-07-30 齐鲁工业大学 Link prediction method and system of knowledge graph
CN113190691B (en) * 2021-05-28 2022-09-23 齐鲁工业大学 Link prediction method and system of knowledge graph
CN113360670A (en) * 2021-06-09 2021-09-07 山东大学 Knowledge graph completion method and system based on fact context
CN113360670B (en) * 2021-06-09 2022-06-17 山东大学 Knowledge graph completion method and system based on fact context
CN113312854A (en) * 2021-07-19 2021-08-27 成都数之联科技有限公司 Type selection recommendation method and device, electronic equipment and readable storage medium
CN113901151A (en) * 2021-09-30 2022-01-07 北京有竹居网络技术有限公司 Method, apparatus, device and medium for relationship extraction
CN113836321A (en) * 2021-11-30 2021-12-24 北京富通东方科技有限公司 Method and device for generating medical knowledge representation
WO2023130960A1 (en) * 2022-01-07 2023-07-13 中国电信股份有限公司 Service resource determination method and apparatus, and service resource determination system

Similar Documents

Publication Publication Date Title
CN111506714B (en) Question answering based on knowledge graph embedding
CN112765369A (en) Knowledge graph information representation learning method, system, equipment and terminal
Cifariello et al. Wiser: A semantic approach for expert finding in academia based on entity linking
CN113535984B (en) Knowledge graph relation prediction method and device based on attention mechanism
US20190354878A1 (en) Concept Analysis Operations Utilizing Accelerators
Bordes et al. A semantic matching energy function for learning with multi-relational data: Application to word-sense disambiguation
Souravlas et al. A classification of community detection methods in social networks: a survey
Qiang et al. Short text clustering based on Pitman-Yor process mixture model
WO2020198855A1 (en) Method and system for mapping text phrases to a taxonomy
JP5881048B2 (en) Information processing system and information processing method
Liu High performance latent dirichlet allocation for text mining
Wu et al. Online fast adaptive low-rank similarity learning for cross-modal retrieval
Chen et al. Conna: Addressing name disambiguation on the fly
Cao et al. Relmkg: reasoning with pre-trained language models and knowledge graphs for complex question answering
Qiu et al. Chinese Microblog Sentiment Detection Based on CNN‐BiGRU and Multihead Attention Mechanism
Chen et al. Affinity regularized non-negative matrix factorization for lifelong topic modeling
Lu et al. Sentiment analysis method of network text based on improved AT-BiGRU model
Fan et al. A heterogeneous graph neural network with attribute enhancement and structure-aware attention
CN115878761B (en) Event context generation method, device and medium
Wang et al. More: A metric learning based framework for open-domain relation extraction
CN114365122A (en) Learning interpretable relationships between entities, relational terms, and concepts through bayesian structure learning of open domain facts
Jian et al. Retrieval Contrastive Learning for Aspect-Level Sentiment Classification
CN110275957B (en) Name disambiguation method and device, electronic equipment and computer readable storage medium
Sun et al. PSLDA: a novel supervised pseudo document-based topic model for short texts
Zhang et al. Zero-shot fine-grained entity typing in information security based on ontology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210507

RJ01 Rejection of invention patent application after publication