CN116186278A

CN116186278A - Knowledge graph completion method based on hyperplane projection and relational path neighborhood

Info

Publication number: CN116186278A
Application number: CN202211648882.3A
Authority: CN
Inventors: 韩亚丹; 陆广泉
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-05-30

Abstract

The invention discloses a knowledge graph completion method based on hyperplane projection and relation path neighborhood, which comprises the following steps: 1) Embedding the knowledge graph by utilizing the structure information of the triples; 2) Joining neighborhood information of the path; 3) Adding mapping attributes of the relationship; 4) Designing a scoring function of the TransH-RPN; 5) When model training is carried out, a probability method is adopted to replace head and tail entities, and meanwhile, when the entities are selected, the entities are selected according to the similarity of the entities; 6) Link prediction based on hyperplane projection and knowledge graph completion of the relationship path neighborhood; 7) Triad classification based on hyperplane projection and knowledge graph completion of the relationship path neighborhood. The method adds the mapping attribute of the relation on the basis of a TransH model; and modeling is performed based on the path neighborhood of the large-scale knowledge graph by combining the neighborhood information of the path, so that the representation learning capacity of the model is improved, and the effect of knowledge graph complementation is improved.

Description

Knowledge graph completion method based on hyperplane projection and relational path neighborhood

Technical Field

The invention belongs to the technical field of knowledge representation learning and knowledge graph completion, and particularly relates to a knowledge graph completion method based on a relational path neighborhood.

Background

Knowledge Graph (knowledgegraph) stores a large number of facts in the real world, is a multiple relationship Graph consisting of entities (nodes) and relationships (different types of edges), is usually expressed in the form of triples (head entities, relationships, tail entities), and can be expressed by letters as (h, r, t). Nowadays, many knowledge maps are constructed, such as WordNet, freebase and Yago, which are widely used in various fields of knowledge reasoning, question-answering and recommendation systems.

Because knowledge base scale is continuously enlarged and data updating period is continuously shortened, knowledge graphs cannot contain all knowledge in the real world, so that missing knowledge needs to be predicted according to the existing knowledge in the knowledge graphs, and the task is called Knowledge Graph Completion (KGC) and comprises link prediction and ternary group classification tasks.

In order to complement the knowledge graph, knowledge representation learning is proposed, and the main idea is that firstly, a knowledge representation learning model is utilized to embed entities and relations of triples in the knowledge graph, then, a scoring function is utilized to score the triples, and finally, scoring results are arranged according to the sequence from high to low, so that the completion work of the knowledge graph is completed.

As the traditional knowledge representation learning method has stronger knowledge graph modeling capability, the method is interesting for students. However, these traditional knowledge representation learning models have some drawbacks. On the one hand, these more typical models are limited by the translation rules, so that the models cannot model complex and diverse entities; on the other hand, when the models are used for embedding the knowledge graph, only the structure information of the triples is concerned, the fact of a single triplet is taken as input, the information of the entity is very limited, and the expression capability of the vector is not strong, so that the models can not well represent the entity and the relation in the knowledge graph, and the problems of the models in the aspect of solving the completion of the knowledge graph are still not ideal. In recent years, in order to enhance knowledge representation learning capabilities of models, various multimodal information such as text descriptions, type constraints, visual information, entity attributes, logical rules, relationship paths, and the like have been used. The knowledge representation capability of the model can be significantly improved by combining the auxiliary information with the structure information of the triples. However, this multivariate information also has several problems: (1) The quality of the multi-element information is good and bad, and the existing model lacks an effective method for extracting useful information from the multi-element information; (2) The variety of the multi-source information is quite rich, but the rich information is not fully utilized; (3) The heterogeneity of head and tail entities in triples is ignored (i.e., the number of head and tail entities in the same relationship in the knowledge graph can sometimes be very large, whereas current models do not take into account the effect of such differences on entity modeling).

Disclosure of Invention

Aiming at the problems of the prior knowledge representation model, the invention provides a knowledge graph completion method based on hyperplane projection and path neighborhood, which adds a relation mapping attribute on the basis of a TransH model; and modeling is performed based on the path neighborhood of the large-scale knowledge graph by combining the neighborhood information of the path, so that the representation learning capacity of the model is improved, and the effect of knowledge graph complementation is improved.

The technical scheme for realizing the aim of the invention is as follows:

a knowledge graph completion method based on hyperplane projection and relation path neighborhood comprises the following steps:

1) Embedding the knowledge graph by utilizing the structural information of the triples: given a triplet (h, r, t), by using the idea of the hyperplane projection of TransH to project entities into a relationship-specific hyperplane, the projected head and tail entities are represented as:

w _r is the normal vector of the hyperplane, d _r Is a translation operation corresponding to the relation, and the scoring function of the TransH is defined as follows: f (f) _r (h,t)＝||h _⊥ +d _r -t _⊥ ||；

2) Joining neighborhood information of the path: for the head entity or the tail entity in a triplet, there are many paths around them, and in order for the model to be able to use the most valuable path neighborhood information, the weight of each path needs to be calculated; the greater the weight value of a path, the more valuable the information describing the path is, and for the head entity and the tail entity in a triplet, there are two connection modes: first, the head entity and the tail entity are directly connected to form a direct path; secondly, the head entity and the tail entity are indirectly connected to form an indirect path, namely, a triplet cannot be directly formed, and the relationship is lost; for tail entities, co-head entities; then, the influence of the path on the entity embedding needs to be considered when the entity and the relation are embedded, and the influence is mainly represented by secondary embedding, wherein the secondary embedding is the calculation of the entity and the relation; therefore, when calculating weights, two cases are also divided: directly selecting the shortest path for the direct path, and taking the reciprocal of the shortest path value as the weight; for indirect paths, selecting nodes among the paths within five ranges (too many nodes are not needed to be selected, the paths are long because of too many nodes, a large amount of time is consumed and a large amount of memory is occupied during training), then accumulating the relation of each path connected with the nodes, selecting the path with the smallest value, and finally taking the reciprocal of the smallest value as the weight;

3) Mapping attributes of joining relationships: by virtue of the concept of the TransM, the TransM considers that each training triplet is associated with a weight representing the mapping degree, and the mapping property of the triplet is greatly dependent on the relation between the head entity and the tail entity, so that the weight is specific to the relation; in order to improve the processing capacity of the model on complex relationships, different weights are given to different relationships, so that the model can distinguish different relationships; in calculating the weight, it is necessary to calculate the average number t of tail entities corresponding to each head entity _r qh _r Average number h of head entities appearing corresponding to each tail entity _r qt _r Then calculate weights for each relationship according to equation (1)

4) The method comprises the following steps of designing a knowledge graph completion model-TransH-RPN scoring function based on hyperplane projection and relation path neighborhood as follows:

wherein->

5) When model training is carried out, a probability method is adopted to replace head and tail entities, and meanwhile, when the entities are selected, the entities are selected according to the similarity of the entities;

5.1 Using probability method to replace head and tail entities: to reduce the generation of false negative triples, for many-to-one relationships, a high probability is chosen to replace the tail entity; for one-to-many relationships, selecting high probability to replace head entities, giving a relationship and triples (h, r, t) of all positive samples related to the relationship, firstly calculating the average number t of tail entities correspondingly appearing in each head entity _r qh _r The method comprises the steps of carrying out a first treatment on the surface of the And the average number h of head entities corresponding to each tail entity _r qt _r When the probability method is adopted, then the method is as follows

Is sampled by Bernoulli distribution; when constructing a negative example triplet by using the positive example triplet, replacing a head entity with probability q, and replacing a tail entity with probabilities 1-q to ensure that the total probability is 1 and the sampling mode accords with Bernoulli distribution;

for each relation r, calculating the average number t of tail entities corresponding to each head entity _r qh _r Average number of head entities h corresponding to each tail entity _r qt _r The method comprises the steps of carrying out a first treatment on the surface of the When t _r qh _r < 1.5 and h _r qt _r < 1.5, meaning that the relationship r is one-to-one; when t _r qh _r > 1.5 and h _r qt _r > 1.5, meaning that the relationship r is many-to-many; when t _r qh _r Not less than 1.5 and h _r qt _r < 1.5, meaning that the relationship r is one-to-many; when t _r qh _r < 1.5 and h _r qt _r Not less than 1.5, the expression relationship r is many-to-one;

5.2 Selecting an entity based on similarity):

when entities are of similar type, thisThe entities are distributed in a relatively close range in the vector space, if the relationship is that the corresponding head entity residing in is a person name, the corresponding tail entity is a place, the person name is concentrated in one area, and the place is concentrated in another area; when similarity judgment is carried out between entities, semantic similarity between the entities or the relationships is selected for judgment and reflected to vector space, namely similarity between vectors is calculated, and a calculation formula is as follows:

thus, given a positive case triplet (h, r, t), when the replacement head entity generates a negative case triplet (h ', r, t), h ' is chosen such that dis (h, h ') is minimal; when the tail entity is replaced to generate a negative case triplet (h, r, t '), t ' is selected such that dis (t, t ') is minimal;

in the model training process, in order to distinguish the correct triples from the wrong triples, the following marginal-based loss function is adopted as an optimization objective function of a training model:

in the formula, S represents a set to which a correct triplet belongs, S' represents a set to which an incorrect triplet belongs, max (x, y) refers to a value between x and y returned, and γ represents a distance between a loss function score of the correct triplet and a loss function score of the incorrect triplet; the objective function is therefore optimized to maximize the separation of the correct triples from the incorrect triples;

in minimizing the objective function L, the following constraints are considered, mainly including:

the meaning expressed by the formula (2) is to ensure that the length of the entity vector is less than or equal to 1, the meaning expressed by the formula (3) is to ensure that the relation r is on a projection plane, and the meaning expressed by the formula (4) is that the hyperplane is a unit normal vector;

when a model is specifically trained, a random gradient descent method is adopted to optimize an objective function, and best experimental data are obtained by adjusting a learning rate, a marginal value, an embedding dimension, a batch processing size and a paradigm type;

6) Link prediction based on hyperplane projection and knowledge graph completion of relational path neighborhood: the goal of the link prediction is to predict h or t missing in the triples (h, r, t) according to the existing knowledge in the knowledge base, firstly construct a negative triplet, remove the head entity or the tail entity from the triples (h, r, t) of the positive example, and replace the head entity or the tail entity in each triplet in the test set with the entity in the set in turn; then calculate the score of these damaged triples, arrange these scores in descending order; finally, the ranking of the correct entity is recorded, and the task emphasizes the ranking of the correct entity, rather than finding only the best entity;

7) Triad classification based on hyperplane projection and knowledge graph completion of relational path neighborhood: the purpose of triplet classification is to determine whether a given triplet (h, r, t) is correct, which is a binary classification task; this method of evaluation requires taking negative samples into account, however, the data that appears in the existing knowledge-graph is considered to be correct, and therefore a negative sample set needs to be constructed such that the ratio of the positive and negative samples is 1:1, a step of; after the negative sample set is constructed, vector representations of entities and relations learned by the model are calculated by using a scoring function to obtain scores of all triples, and a threshold sigma is determined when maximum classification accuracy is obtained according to a verification set during training _r This threshold sigma _r Closely related to the relationship, determining different thresholds for different relationships; for a triplet (h, r, t), if the score is less than a given threshold σ _r Then the prediction is correct and vice versa.

Compared with the prior knowledge representation model, the technical scheme has the beneficial effects that:

1) The neighborhood information of the path is fully utilized, and the representation learning capacity of the model is improved;

2) The mapping attribute of the relation is added, so that the model is more good at processing the complex relation in the triplet;

3) The probability method is used for replacing the head entity and the tail entity in the triples, so that the quality of the generated negative triples is improved.

By combining the points, the technical scheme finally optimizes the effects of link prediction and triplet classification, and is superior to the traditional baseline model.

The method adds relation mapping attribute on the basis of a TransH model; and modeling is performed based on the path neighborhood of the large-scale knowledge graph by combining the neighborhood information of the path, so that the representation learning capacity of the model is improved, and the effect of knowledge graph complementation is improved.

Drawings

FIG. 1 is a diagram of a TransH-RPN model in an embodiment;

FIG. 2 is an exemplary diagram of an indistinguishable entity.

Detailed Description

The present invention will now be further illustrated with reference to the drawings and examples, but is not limited thereto.

Examples:

referring to fig. 1, a knowledge graph completion method based on hyperplane projection and relationship path neighborhood includes the following steps:

2) Joining neighborhood information of the path: there are many paths around the head entity and the tail entity in the triplet, and in order for the model to be able to utilize the most valuable path neighborhood information, the weight of each path that the head entity and the tail entity are connected to needs to be calculated; the greater the weight value of a path, the more valuable the information describing the path is, and there may be two ways in which the head entity and the tail entity of a triplet in the knowledge-graph are connected: first, the head entity and the tail entity are directly connected to form a direct path; secondly, the head entity and the tail entity are indirectly connected to form an indirect path, namely, a triplet cannot be directly formed, the relation is lost, and for the tail entity, the head entity and the tail entity are identical; then, the effect of the path on the entity embedding needs to be considered when the entity and the relation are embedded, and the effect is mainly represented by secondary embedding, wherein the secondary embedding is the calculation of the entity and the relation, and therefore, when the weight is calculated, two cases are also divided: directly selecting the shortest path for the direct path, and taking the reciprocal of the shortest path value as the weight; for indirect paths, selecting nodes among the paths within five ranges, accumulating the relation of each path connected with the nodes, selecting the path with the smallest value, and finally taking the reciprocal of the smallest value as a weight;

3) Mapping attributes of joining relationships: there are four relationships between the head and tail entities of the triplet: 1-1,1-N, N-1 and N-N, however, when the entity and the relation are projected by using the TransH, different relations in the triples are not distinguished, so that a relation mapping attribute is added on the basis of the hyperplane projection of the TransH, and the thought of the TransM is used for reference; in this example, each training triplet is associated with a weight representing the degree of mapping, the mapping nature of the triplet being largely dependent on the head entity to tail entity relationship, and hence the weights being relationship specific; in order to improve the processing capability of the model on complex relationships, different weights are given to different relationships, so that the model can distinguish different relationships, and when the weights are calculated, the average number t of tail entities which correspondingly appear in each head entity is required to be calculated _r qh _r Average of head entities appearing corresponding to each tail entityQuantity h _r qt _r Then calculate weights for each relationship according to equation (1)

By calculating weights for different relations, the model can distinguish different relations in the triples, so that the capability of the model for processing complex relations is improved;

4) Based on the steps 1) -3), designing a knowledge graph completion model-TransH-RPN scoring function based on the hyperplane projection and the relation path neighborhood as follows:

wherein the method comprises the steps of

Weights representing the relationship paths;

5.1 Using probability method to replace head and tail entities: to reduce the generation of false negative triples, for many-to-one relationships, a high probability is chosen to replace the tail entity; for one-to-many relationships, selecting a high probability replacement header entity; given a relation and triples (h, r, t) of all positive samples related to the relation, firstly, the average number t of tail entities which correspondingly appear in each head entity is calculated _r qh _r The method comprises the steps of carrying out a first treatment on the surface of the And the average number h of head entities corresponding to each tail entity _r qt _r When the probability method is adopted, then the method is as follows

When the positive case triples are utilized to construct the negative case triples, the probability is used for replacing the head entity, the probability is used for replacing the tail entity, the probability is 1-q, the total probability is 1, and the sampling mode accords with the Bernoulli distribution;

5.2 Selecting an entity based on similarity):

when the entities have similar types, the entities are distributed in a relatively close range in a vector space, for example, the relationship is that the corresponding head entity is a person name, the corresponding tail entity is a place, the person name is concentrated in one area, the place is concentrated in another area, as shown in fig. 2, when the similarity between the entities is judged, the similarity of the semantics between the entities or the relationship is selected for judgment, and the similarity between the vectors is reflected in the vector space, namely, calculated by the formula:

in the formula, S represents the set to which the correct triplet belongs, S' represents the set to which the error triplet belongs, max (x, y) refers to the return of a value between x and y, and gamma represents the distance between the loss function score of the correct triplet and the loss function score of the error triplet, so that the optimization objective of the objective function is to separate the correct triplet and the error triplet to the greatest extent;

6) Link prediction based on hyperplane projection and knowledge graph completion of relational path neighborhood: the goal of the link prediction is to predict h or t missing in the triples (h, r, t) according to the existing knowledge in the knowledge base, firstly, a negative triplet needs to be constructed, the head entity or the tail entity of the triples (h, r, t) of the positive example is removed, and the head entity or the tail entity in each triplet in the test set is replaced by the entity in the set in sequence; then calculate the score of these damaged triples, arrange these scores in descending order; finally, the ranking of the correct entity is recorded; the task emphasizes the ranking of the correct entity, rather than finding only the best one;

7) Triad classification based on hyperplane projection and knowledge graph completion of relational path neighborhood: the purpose of triplet classification is to determine whether a given triplet (h, r, t) is correct, which is a binary classification task; this method of evaluation requires taking negative samples into account, however, the data that appears in the existing knowledge-graph is considered to be correct, and therefore a negative sample set needs to be constructed such that the ratio of the positive and negative samples is 1:1, a step of; then, vector representations of entities and relations learned by the model are calculated by using a scoring function, and scores of all triples are obtained; at training time, a threshold sigma is determined when maximum classification accuracy is obtained from the validation set _r This threshold sigma _r Closely related to the relationship; determining different thresholds for different relationships, for a triplet (h, r, t), if scoredLess than a given threshold sigma _r Then the prediction is correct and vice versa.

The knowledge graph learning method is used as a model for knowledge representation learning and applied to the field of knowledge graph completion. The knowledge representation learning model maps the entity and the relation into a low-dimensional continuous space, and predicts a missing link in the knowledge graph through calculation of a numerical vector so as to complete the completion work of the knowledge graph. In this example, firstly, the proposed TransH-RPN model is utilized to project the entities in the knowledge graph into the hyperplane specific to the relation, so as to obtain the vector representation of the head entity and the tail entity in the hyperplane, then the scoring function f (h, t) is utilized to calculate the vector, and the probability that one candidate triplet is established is judged, so that the score of the positive triplet is greater than the score of the negative triplet through optimizing the objective function. After scoring, the scores of all triples are ranked in order from high to low, the higher the score is, the greater the probability that the triples are established is, and the triples with the highest score are added into the knowledge graph, so that the completion work of the knowledge graph is completed.

Claims

1. A knowledge graph completion method based on hyperplane projection and relation path neighborhood is characterized by comprising the following steps:

2) Joining neighborhood information of the path: to improve the representation capability of the model, adding neighborhood information of the path; a plurality of paths are arranged around a head entity and a tail entity in the triplet, in order to enable the model to utilize the most valuable path neighborhood information, the weight of each path needs to be calculated, and the greater the weight value of the path is, the information indicating the path is the most valuable; for the head entity and the tail entity in a triplet, there are two connection modes: first, the head entity and the tail entity are directly connected to form a direct path; secondly, the head entity and the tail entity are indirectly connected to form an indirect path, namely, a triplet cannot be directly formed, and the relationship is lost; for tail entities, co-head entities; then, the influence of the path on the entity embedding needs to be considered when the entity and the relation are embedded, and the influence is mainly represented by secondary embedding, wherein the secondary embedding is the calculation of the entity and the relation; therefore, when calculating weights, two cases are also divided: directly selecting the shortest path for the direct path, and taking the reciprocal of the shortest path value as the weight; for indirect paths, selecting nodes between paths to be in five ranges, accumulating the relation of each path connected with the nodes, selecting the path with the smallest value, and finally taking the reciprocal of the smallest value as a weight;

3) Mapping attributes of joining relationships: by referring to the concept of the TransM, the TransM considers that each training triplet is associated with a weight representing the mapping degree, and the mapping property of the triplet depends on the relation between a head entity and a tail entity in the triplet, so the weight is specific to the relation; in order to improve the processing capacity of the model on complex relationships, different weights are given to different relationships, so that the model can distinguish different relationships; when calculating the weight, the average number t of tail entities which correspondingly appear in each head entity needs to be calculated _r qh _r Average number h of head entities appearing corresponding to each tail entity _r qt _r Then calculate weights for each relationship according to equation (1)

wherein->

When the positive case triples are utilized to construct the negative case triples, the probability q is used for replacing the head entity, the probability 1-q is used for replacing the tail entity, the total probability is 1, and the sampling mode accords with the Bernoulli distribution;

5.2 Selecting an entity based on similarity):

when entities are of similar type, the entities are facingThe measurement space can be distributed in a range with a relatively short distance, for example, the relationship that the head entity corresponding to living is a name of a person, the tail entity corresponding to the head entity is a place, the name of the person can be concentrated in one area, and the place can be concentrated in another area; when similarity judgment is carried out between entities, semantic similarity between the entities or the relationships is selected for judgment and reflected to vector space, namely similarity between vectors is calculated, and a calculation formula is as follows:

6) Link prediction based on hyperplane projection and knowledge graph completion of relational path neighborhood: the goal of the link prediction is to predict the missing h or t in the triplet (h, r, t) based on knowledge known in the knowledge base; firstly, constructing a negative triplet, removing a head entity or a tail entity from a positive triplet (h, r, t), and sequentially replacing the head entity or the tail entity in each triplet in the test set by the entities in the set; then calculate the score of these damaged triples, arrange these scores in descending order; finally, the ranking of the correct entity is recorded, and the task emphasizes the ranking of the correct entity, rather than finding only the best entity;

7) Triad classification based on hyperplane projection and knowledge graph completion of relational path neighborhood: the purpose of triplet classification is to determine whether a given triplet (h, r, t) is correct, which is a binary classification task; this method of evaluation requires taking negative samples into account, however, the data that appears in the existing knowledge-graph is considered to be correct, and therefore a negative sample set needs to be constructed such that the ratio of the positive and negative samples is 1:1, a step of; after the negative sample set is constructed, vector representations of entities and relations learned by the model are calculated by using a scoring function to obtain scores of all triples, and a threshold sigma is determined when maximum classification accuracy is obtained according to a verification set during training _r This threshold sigma _r Closely related to the relationship; determining different thresholds for different relationships, for a triplet (h, r, t), if the score is smaller than a given threshold σ _r Then the prediction is correct, otherwise the errorError.