CN116701647A

CN116701647A - Knowledge graph completion method and device based on fusion of embedded vector and transfer learning

Info

Publication number: CN116701647A
Application number: CN202310575362.2A
Authority: CN
Inventors: 崇庆魏; 傅薛林; 蔡炎松; 窦辰晓
Original assignee: Nanhu Research Institute Of Electronic Technology Of China
Current assignee: Nanhu Research Institute Of Electronic Technology Of China
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-09-05

Abstract

The application discloses a knowledge graph completion method and a device based on fusion of embedded vectors and transfer learning, comprising the steps of performing coarse row scoring on predicted triplets in a knowledge graph to be completed by utilizing the embedded vectors of a shallow network, and selecting Top-k results; performing fine ranking scoring on the prediction triples in the Top-k results by using a pre-training language model, and selecting Top-N results; and complementing the knowledge graph to be complemented by using the prediction triples in the Top-N results. The application solves the problems of poor knowledge graph completion effect, high completion reasoning time consumption and excessive use of calculation resources.

Description

Knowledge graph completion method and device based on fusion of embedded vector and transfer learning

Technical Field

The application belongs to the technical field of knowledge maps, and particularly relates to a knowledge map completion method and device based on fusion of embedded vectors and transfer learning.

Background

With the advent of artificial intelligence and data age, information overload is the most needed problem in this age, and knowledge graph is one of the most important ways to solve information overload and accumulation. The knowledge graph is a graph of the edges of the relationship of countless entity nodes and linked entities, which is formed by filtering and extracting a large amount of information and precipitating the most core knowledge. Therefore, the knowledge graph is widely applied, such as a recall ordering stage of an e-commerce recommendation search system, a question-answering system and the like. However, as the research of the knowledge graph is in depth, researchers find that the knowledge graph has serious completeness problem, for example, 71% of human entities in Freebase data lack information of birth places, and 58% of scientific research directions in DBpedia are lost. The imperfection can affect the practical application effect of the knowledge graph, so the knowledge graph complement is an important research direction of the knowledge graph. At present, a common knowledge graph completion algorithm is as follows:

1. rule-based knowledge graph completion algorithm: the rule-based knowledge graph completion algorithm is one of the earliest knowledge graph completion algorithms used. It infers new information in the knowledge graph by manually writing rules. For example, rules may be used to infer a relationship or attribute between two entities. However, this method is limited in that the writing and maintenance of rules is labor intensive and complex semantic relationships are difficult to handle.

2. Knowledge graph completion algorithm based on statistical learning: with the rise of machine learning and statistical learning, the statistical learning-based method is gradually applied to knowledge graph completion. These methods utilize large-scale corpora and statistical models for reasoning and prediction. For example, a relationship classification-based approach may predict relationships between entities based on their semantic features. In addition, there are graph model-based methods, such as conditional random fields and Markov logic networks, for modeling and reasoning about knowledge maps. However, the method needs to be based on a large-scale corpus, the data acquisition and processing difficulties are high, the output reasoning effect is too dependent on the quality of the corpus, and the stable knowledge graph completion effect is difficult to ensure.

3. Knowledge graph completion algorithm based on embedded vector: embedding vectors is a technique that maps entities and relationships to a continuous vector space, capable of capturing semantic associations between entities and relationships. The knowledge graph completion algorithm based on the embedded vector performs prediction completion by learning the embedded vectors of the entities and the relations and calculating the similarity between them in the vector space. These methods include shallow networks such as TransE, transR, transH. The embedded vector approach solves the limitations of rule and statistical learning approaches to some extent, but still has challenges for negative sample sampling and model parameter tuning.

4. Knowledge graph completion algorithm based on graph neural network: the graphic neural network is a type of neural network model which has been raised in recent years and is used for processing graphic structure data. The method represents the knowledge graph as a graph and utilizes the information of the nodes and the edges to make reasoning and prediction. The graph neural network can learn the representation of the nodes and the edges, and capture the information propagation capability of the hierarchical structure of the graph neural network through the multi-layer network structure, so that the graph neural network has advantages in knowledge graph completion. For example, graph Convolutional Networks (GCN) updates the representation of a node by utilizing the characteristics of the node and its neighbor nodes at each level. Graph Attention Networks (GAT) introduces a focus mechanism that can dynamically focus on important neighbor nodes. And the graphSAGE performs efficient graph neural network training by sampling the neighbor nodes. The method has good performance in knowledge graph completion, and can capture more complex semantic relations and graph structures. But the method

5. Knowledge graph completion algorithm based on transfer learning: the migration learning can solve the problems of data scarcity and difficult labeling by utilizing the existing knowledge and model to migrate the migration learning to a new task or field. In the knowledge graph completion, the transfer learning can utilize the existing knowledge graph and the data of the related fields to perform the completion task. For example, pre-trained language models may be used to obtain embedded representations of entities and relationships and apply them to knowledge-graph completion tasks. In addition, the transfer learning can also utilize the cross-language and cross-domain knowledge graphs to transfer the information in the cross-language and cross-domain knowledge graphs to the target knowledge graphs, so that the complement effect is improved.

Among all the knowledge-graph completion algorithms, the behavior of the knowledge-graph completion algorithm based on transfer learning is optimal. However, a great problem with the method of transfer learning is that the model is very complex and the computational effort and time-consuming reasoning are enormous. Moreover, the effect of the algorithm can have the probability of ignoring the original information of the inference knowledge graph, and the original knowledge is learned by overuse of migration, so that the inference effect is inferior to a method based on embedded vector completion, such as TransE and the like. The existing knowledge graph completion algorithm can realize the completion of the knowledge graph, but has some defects in various aspects, and the expected effect cannot be really achieved.

Disclosure of Invention

The application aims to provide a knowledge graph completion method based on fusion of embedded vectors and transfer learning, which solves the problems of poor knowledge graph completion effect, high time consumption of completion reasoning and excessive use of computational resources.

In order to achieve the above purpose, the technical scheme adopted by the application is as follows:

the knowledge graph completion method based on the fusion of the embedded vector and the transfer learning comprises the following steps:

coarse row scoring is carried out on the predicted triples in the knowledge graph to be complemented by utilizing the embedded vectors of the shallow network, and Top-k results are selected;

performing fine ranking scoring on the prediction triples in the Top-k results by using a pre-training language model of transfer learning, and selecting Top-N results;

and complementing the knowledge graph to be complemented by using the prediction triples in the Top-N results.

The following provides several alternatives, but not as additional limitations to the above-described overall scheme, and only further additions or preferences, each of which may be individually combined for the above-described overall scheme, or may be combined among multiple alternatives, without technical or logical contradictions.

Preferably, the input of the pre-training language model is a sequence constructed by CLS, head Description, SEP, relation Description, SEP, tail Description and SEP, wherein CLS is a classification token, SEP is a segmentation token, head Description is a Head entity in the prediction triplet, relation Description is a relation in the prediction triplet, and Tail Description is a Tail entity in the prediction triplet.

Preferably, the backup model in the pre-training language model is a BERT model, and the BERT model is connected with the classifier to output the fine-ranking scoring score.

Preferably, when the pre-training language model is trained, taking triples in a known knowledge graph as real triples, and for each real triplet, replacing a head entity or a tail entity in the real triples with other entities in the known knowledge graph to obtain virtual triples, taking the real triples as positive samples, and taking the virtual triples as negative samples to perform contrast learning training on the pre-training language model.

Preferably, the total loss when the pre-trained language model is trained is the sum of structural loss and contrast learning loss.

Preferably, the structural loss calculation formula is as follows:

wherein, I _s For structural loss, f () is a functional expression of structural loss, h is an Embedding representation of a head entity in the prediction triplet, r is an Embedding representation of a relationship in the prediction triplet, and t is an Embedding representation of a tail entity in the prediction triplet.

Preferably, the comparison learning loss calculation formula is as follows:

wherein, I _c In order to compare learning loss, phi (h, r, t) is the fraction of positive samples output by the pre-training language model, gamma is the super-parameter of margin, tau is the temperature coefficient, N is the number of negative samples corresponding to the positive samples,the fraction of the ith negative sample output for the pre-trained language model.

Preferably, the number k of the Top-k results is determined according to the following formula:

in the formula, all_node_num is the number of nodes in the knowledge graph to be complemented, and a is a preset number threshold.

Preferably, the complementing the complement knowledge graph by using the prediction triples in the Top-N results comprises: and adding all or part of the predicted triples in the Top-N results as real triples into the knowledge graph to be complemented.

According to the knowledge graph completion method based on the fusion of the embedded vector and the transfer learning, a shallow network is introduced to filter the inferred candidate set, so that the correct solution of the Top-k candidate set is ensured, and a large number of invalid solutions are filtered, and the problems of high time consumption of complete reasoning and excessive use of computational resources in the prior art are solved. And then, a pre-training language model is utilized to carry out fine-ranking screening on the basis, so that the problem of poor completion effect of the existing knowledge graph is solved.

The second object of the present application is to provide a knowledge graph completing device based on the fusion of the embedded vector and the transfer learning, which comprises a processor and a memory storing a plurality of computer instructions, wherein the computer instructions realize the steps of the knowledge graph completing method based on the fusion of the embedded vector and the transfer learning when being executed by the processor.

Drawings

FIG. 1 is a flow chart of a knowledge graph completion method based on the fusion of embedded vectors and transfer learning;

FIG. 2 is a logic diagram of data transfer according to an embodiment of a knowledge graph completion method based on fusion of embedded vectors and transfer learning;

FIG. 3 is a schematic diagram of the structure of the pre-training language model of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

The knowledge graph completion aims at predicting the missing part in the triplet, so that the knowledge graph becomes more complete. For a knowledge graph, the basic components include an entity set, a relationship set and corresponding triples. Because the number of entities and relationships in the knowledge-graph is limited, there may be some entities and relationships that are not in the knowledge-graph.

According to the content to be complemented, the knowledge graph complement can be divided into three subtasks. (1) Predicting a head entity given a triplet comprising only a relationship and a tail entity; (2) Predicting a tail entity given a triplet comprising only a head entity and a relationship; (3) A triplet containing only the head entity and the tail entity is given to predict the relationship between the head entity and the tail entity.

According to whether the entity and the relation in the triplet belong to the original entity and the relation in the knowledge graph, the knowledge graph completion can be divided into two major categories of static knowledge graph completion (static KGC) and Dynamic knowledge graph completion (Dynamic KGC). The entity and the relation involved in the static knowledge graph completion are all found in the original knowledge graph, and the dynamic knowledge graph completion is related to the relation and the entity which are not found in the original knowledge graph, so that the entity and the relation set of the original knowledge graph can be expanded through the completion of the knowledge graph. The knowledge graph completion method is mainly applied to static knowledge graph completion.

As shown in fig. 1, the embodiment provides a knowledge graph completion method based on fusion of embedded vectors and transfer learning, which includes the following steps:

and step 1, performing coarse row scoring on the predicted triples in the knowledge graph to be complemented by using the embedded vectors of the shallow network, and selecting Top-k results.

The embodiment introduces a shallow network to filter the inferred candidate set, ensures the correct solution of the Top-k candidate set and filters the results of a large number of invalid solutions. The embedded vector of the shallow network may be a RotatE network model, or may be replaced by a network model such as TransE, transH, transR, convKB, convE.

As shown in fig. 2, the embodiment adopts a RotatE network model, which has the advantage that the inference result of the predicted data can be rapidly calculated through a vector Hadamard product (element product), and although the RotatE network model has poor ordering ability for the Top part, whether the head entity and the tail entity corresponding to the relation are relatively similar can be well distinguished, so that most of irrelevant entities can be filtered out.

K in Top-k is a super parameter, and may be selected according to a node rule of the knowledge graph, for example, in this embodiment, a number k determining manner is provided as follows:

in the formula, all_node_num is the number of nodes in the knowledge graph to be complemented, a is a preset number threshold, and the number is set according to experience, for example, 100, 200, 300 and the like.

It should be noted that, the prediction triples herein are understood as the to-be-completed triples of the head entity, the relationship or the tail entity, and the to-be-completed triples are added with the prediction triples formed by the entities or the relationships in the to-be-completed knowledge graph. For example, if there is a triplet of missing head entities in the knowledge graph, each entity in the entity set of the knowledge graph is used as a head entity, and a plurality of prediction triples are formed by the triplet of missing head entities. The shallow network performs coarse row scoring on the plurality of prediction triples formed.

Before the RotatE network model is used, training the RotatE network model is required, in this embodiment, the RotatE network model is trained according to the existing knowledge graph, part of the real relationship is extracted from the existing knowledge graph to be used as a test set, and the rest is used as a training set to complete training and testing of the RotatE network model. It should be noted that, the knowledge graph in fig. 2 refers to that the training data is taken from the knowledge graph, and the object of knowledge completion is the knowledge graph, which is not limited to that the training data and the knowledge graph to be completed are the same knowledge graph.

In order to improve the completion effect and efficiency, the embodiment introduces a shallow network, and the scoring of the shallow network is not used for directly outputting the prediction result of the knowledge graph completion, but mainly utilizes the shallow network to rapidly filter a large number of irrelevant nodes to perform one-time effective filtering on the whole candidate set.

And 2, performing fine ranking scoring on the predicted triples in the Top-k results by using a pre-training language model of transfer learning, and selecting Top-N results.

The transfer learning solves the problems of data scarcity and difficult labeling by transferring the existing knowledge graph and model to a new task or field. The unsupervised training attribute of the pre-trained language model enables the pre-trained language model to easily acquire massive training samples, the trained language model contains a plurality of semantic grammar knowledge, and the effect on downstream tasks is obviously improved. In this embodiment, the backup model in the pre-training language model is a BERT model, and the BERT model is connected with the classifier to output the fine-ranking scoring score.

In the embodiment, a pre-train model BERT is introduced as an effect of a backup lifting model Embedding. In other embodiments the backup model is not limited to BERT, but may be a similar pre-train model, such as model Nezha, reberta, macBert, ALBert, xlnet, ernie, spanBert, as a pre-training model.

In order to improve the model training effect, the embodiment introduces a contrast learning idea, the pre-training language model is a contrast learning-based BERT model (C-BERT model for short), a complete triplet in a known knowledge graph is taken as a real triplet during training, and for each real triplet, a head entity or a tail entity in the real triplet is replaced by other entities in the known knowledge graph to obtain a virtual triplet, the real triplet is taken as a positive sample, and the virtual triplet is taken as a negative sample to perform contrast learning training on the pre-training language model. In order to ensure the contrast learning training effect, a certain ratio range is generally set for the number of positive and negative samples, for example, the ratio of the number of positive samples to the number of negative samples is 1:10, which can be adjusted according to empirical values.

As shown in FIG. 3, the inputs to the pre-trained language model are sequences constructed from CLS, head Description, SEP, relation Description, SEP, tail Description, and SEP, where CLS is a classification token, SEP is a segmentation token, head Description is the Head entity in the predicted triplet, relation Description is the relationship in the predicted triplet, and Tail Description is the Tail entity in the predicted triplet.

And outputting the first entity's Embedding representation, the relation's Embedding representation and the tail entity's Embedding representation by the input part through a BERT model (Language model), and finally obtaining scoring scores of corresponding triples by the entity's Embedding representation, the relation's Embedding representation and the tail entity's Embedding representation through a classifier.

After introducing the contrast learning idea, the total Loss (Final Loss) during C-BERT model training is the sum of structural Loss (struct Loss) and contrast learning Loss (contrast Loss). Thus, the whole model fully learns the sense, and the characterization capability of the model is further improved through the comparison of the negative sample and the positive sample.

Wherein, the structural loss calculation formula is as follows:

wherein, I _s For structural loss, f () is a functional representation of structural loss, the functional representation is an L2 loss function, h is an Embedding representation of a head entity in a predicted triplet, r is an Embedding representation of a relationship in the predicted triplet, and t is an Embedding representation of a tail entity in the predicted triplet. I.e. the vector of the head entity plus the relation vector minus the vector of the tail entity, such that this value is minimal, smaller indicates that the three are closer.

In addition, the comparative learning loss calculation formula is as follows:

wherein, I _c In order to compare learning loss, phi (h, r, t) is the fraction of positive samples (h, r, t) output by the pre-training language model, gamma is the super-parameter of margin, tau is the temperature coefficient, i.e. the super-parameter, an empirical value is adjusted according to the training effect, N is the number of negative samples corresponding to the positive samples,the i-th negative sample outputted for the pre-trained language model +.>Is a fraction of (a).

Wherein the method comprises the steps ofFor other entities than entity h in the knowledge graph, ++>For removing from knowledge graphOther entities than entity t +.>Representing a negative sample derived from the positive sample (h, r, t) and not representing simultaneous replacement of the entity h and the entity t in the positive sample, only randomly replacing the entity h or the entity t in the positive sample when constructing the negative sample. And the negative samples and the positive samples are subjected to comparison learning, so that the score of the positive samples is ensured to be far greater than that of all the negative samples.

And 3, complementing the complement knowledge graph by using the prediction triples in the Top-N results.

The number N of Top-N in this embodiment is set according to actual needs, and may be, for example, 5, 10, 15, etc. And after Top-N results are obtained, all the predicted triples in the Top-N results can be directly added into the knowledge graph to be complemented as real triples, or part (such as the first one or the first one and the second one) of the Top-N results can be further selected to be added into the knowledge graph to be complemented as real triples.

In order to verify the advantages of the knowledge graph completion method of the present embodiment, a comparative experiment is provided, for example, as follows:

the disclosed WN18RR data set is taken as an experimental data basis, an algorithm is operated on one A100 display card, and the knowledge graph completion method and the existing SOTA model are taken as experimental objects. The fill-up time, the duty cycle of the top10 correct result, and the MR index are recorded. The experimental results are shown in table 1.

Table 1 table of experimental results

Model	hits@10	MR	TIME USED
				SOTA model	0.817	35	360H
The method of the application	0.865	6.62	17H

In the table, hits@10 represents the duty cycle of the correct result for top10, with the larger the better. MR (Mean Rank) is a ranking index, and the smaller the better. TIME USED is TIME consuming.

As can be seen from Table 1, the calculation amount of the reasoning of the method is greatly reduced compared with the existing SOTA model (such as LASS, simKGC and the like), and the calculation time can be reduced by 95% according to the size of the data set. In addition, the application uses a comparison learning method of random mask and introduces a sequence mode of CLS and SEP to an input part of reasoning, further improves the ordering capability of Top candidate sets, can see the table, and uses the comparison learning C-BERT final reasoning effect to be far better than the effect of the current SOTA model, so that the method is better than the effect of the current SOTA model on single ordering effect. In conclusion, the application shortens the time consumption of reasoning by 95%, improves the accuracy of the hits@10 by 5.8%, greatly improves the efficiency of reasoning, saves a large amount of computational resources, and improves the accuracy of reasoning.

In another embodiment, the application also provides a knowledge graph completion device based on the fusion of the embedded vector and the transfer learning, which comprises a processor and a memory storing a plurality of computer instructions, wherein the computer instructions realize the steps of the knowledge graph completion method based on the fusion of the embedded vector and the transfer learning when being executed by the processor.

For specific limitation of the knowledge-graph completion device based on the fusion of the embedded vector and the transfer learning, reference may be made to the limitation of the knowledge-graph completion method based on the fusion of the embedded vector and the transfer learning hereinabove, and the description thereof will not be repeated here.

The memory and the processor are electrically connected directly or indirectly to each other for data transmission or interaction. For example, the components may be electrically connected to each other by one or more communication buses or signal lines. The memory stores a computer program executable on a processor that implements the method of the embodiments of the present application by running the computer program stored in the memory.

The Memory may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc. The memory is used for storing a program, and the processor executes the program after receiving an execution instruction.

The processor may be an integrated circuit chip having data processing capabilities. The processor may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), and the like. The methods, steps and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. The knowledge graph completion method based on the fusion of the embedded vector and the transfer learning is characterized by comprising the following steps of:

2. The knowledge graph completion method based on the fusion of embedded vectors and transfer learning of claim 1, wherein the input of the pre-training language model is a sequence constructed by CLS, head Description, SEP, relation Description, SEP, tail Description and SEP, wherein CLS is a classification token, SEP is a segmentation token, head Description is a Head entity in a prediction triplet, relation Description is a relation in a prediction triplet, and Tail Description is a Tail entity in a prediction triplet.

3. The knowledge-graph completion method based on the fusion of embedded vectors and transfer learning of claim 1, wherein a back bone model in the pre-training language model is a BERT model, and the BERT model is connected with a classifier to output a fine-ranking scoring score.

4. The knowledge graph completion method based on the fusion of the embedded vector and the transfer learning as claimed in claim 1, wherein the training of the pre-training language model takes triples in the known knowledge graph as real triples, and for each real triplet, replaces a head entity or a tail entity in the real triples with other entities in the known knowledge graph to obtain a virtual triplet, takes the real triples as positive samples, and takes the virtual triples as negative samples to perform contrast learning training on the pre-training language model.

5. The knowledge-graph completion method based on the fusion of embedded vectors and transfer learning of claim 4, wherein the total loss in training of the pre-trained language model is the sum of structural loss and contrast learning loss.

6. The knowledge-graph completion method based on the fusion of embedded vectors and transfer learning of claim 5, wherein the structural loss calculation formula is as follows:

7. The knowledge graph completion method based on the fusion of embedded vectors and transfer learning as claimed in claim 5, wherein the comparison learning loss calculation formula is as follows:

wherein, I _c To compare learning loss, phi (h, r, t) is the fraction of positive samples output by the pre-training language model, gamma isThe super-parameters of margin, tau is the temperature coefficient, N is the number of negative samples corresponding to the positive sample,the fraction of the ith negative sample output for the pre-trained language model.

8. The knowledge-graph completion method based on the fusion of embedded vectors and transfer learning as claimed in claim 1, wherein the number k of the Top-k results is determined according to the following formula:

9. The knowledge-graph completion method based on the fusion of embedded vectors and transfer learning as set forth in claim 1, wherein the completion of the completion knowledge-graph by using the prediction triples in the Top-N results includes: and adding all or part of the predicted triples in the Top-N results as real triples into the knowledge graph to be complemented.

10. The knowledge graph completion device based on the fusion of the embedded vector and the transfer learning comprises a processor and a memory storing a plurality of computer instructions, and is characterized in that the computer instructions realize the steps of the knowledge graph completion method based on the fusion of the embedded vector and the transfer learning according to any one of claims 1 to 9 when being executed by the processor.