CN114077676B

CN114077676B - Knowledge graph noise detection method based on path confidence

Info

Publication number: CN114077676B
Application number: CN202111393836.9A
Authority: CN
Inventors: 马江涛; 周辰宇; 王艳军; 李端阳; 贾泽臣; 马宇科; 李霆; 卢威光; 张蓓蕾; 李清扬; 赵一帆
Original assignee: Henan Tupu Information Technology Co ltd; Zhengzhou University of Light Industry
Current assignee: Henan Tupu Information Technology Co ltd; Zhengzhou University of Light Industry
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-09-30
Anticipated expiration: 2041-11-23
Also published as: CN114077676A

Abstract

The invention provides a knowledge graph noise detection method based on path confidence coefficient, which comprises the following steps: firstly, initializing triples, finding all paths of all triples, carrying out embedded representation on each triplet of each path by using a translation model TransE algorithm, and representing all paths of the triples as path embedded sequences; wherein, a node is formed between adjacent triples in the path embedding sequence; secondly, sequentially inputting the nodes into the CPLL to calculate the confidence score of each node in each path; respectively obtaining a scoring matrix of each path from each path of Bi-GRU; and finally, taking the L2 norm of the score matrix of each path as a path confidence coefficient, and taking the corresponding score matrix when the path confidence coefficient is highest as the optimal embedding matrix of the triplet. The invention combines the method based on the path and the method based on the rule, and improves the efficiency of detecting the noise in the knowledge graph, thereby improving the quality of the knowledge graph.

Description

Knowledge graph noise detection method based on path confidence

Technical Field

The invention relates to the technical field of knowledge graphs, in particular to a knowledge graph noise detection method based on path confidence.

Background

Nowadays, knowledge-graphs play an important role in solving the task of artificial intelligence. However, manually or automatically constructed knowledgemaps have a number of quality issues, and often contain some erroneous or missing triples. Noise in the knowledge-graph may be caused by human error or errors in the data, with most noise appearing as erroneous entities or relationships in the triples. Currently, more and more scholars are beginning to focus on the problem of knowledge-graph noise and come up with many solutions.

Noise detection methods in knowledge-graphs can be broadly divided into path-based methods and rule-based methods. Path-based methods start with TransE, TransH, TransR, etc. translation models, which, although they are mostly used for knowledge-graph embedded representation and completion, can also be used to detect noise in the knowledge-graph. The PaTyBRED model proposed by Melo et al, which incorporates type and path features into a local relationship classifier, preserving a specific path for each relationship to indicate whether a triplet is erroneous. Xie et al propose a CKRL model that utilizes the local and global information of triples to represent the probability of a triplet being erroneous. However, the path-based approach is weak in the ability to find noise and is not suitable for processing knowledge-graphs containing complex relationships. Rule-based methods generally have a stronger noise detection capability than path-based methods. Brocheler et al propose a PSL model that extracts the most likely correct triples from ambiguous triples using first order predicate logic and weighting rules. Abedini et al propose Correction Tower, identifying discrete, inconsistent and error relationships in triples in three steps. However, rule-based methods lack the ability to represent the knowledge-map, i.e., after the rule-based methods detect and reject noise in the knowledge-map, it is also necessary to map the knowledge-map to a continuous vector space in order to make it more convenient to manipulate the knowledge-map in downstream tasks.

If the path-based approach and the rule-based approach can be combined, not only noise can be found, but a noise-free knowledge graph representation can also be constructed. Specifically, firstly, in the path of the triple, a rule is made to screen out the effective features. These features are required to distinguish noise information from correct information, and the correct information includes global triplet information and local triplet information. And then, the noise detection and the triple representation work are completed by utilizing the characteristics, so that the quality of the knowledge graph is improved, and the user experience is improved.

Disclosure of Invention

The invention provides a method for detecting noise of a knowledge graph based on path confidence, which is used for solving the technical problems that the existing method based on the path is weak in noise finding capability and is not suitable for processing the knowledge graph containing complex relationships and the rule-based method lacks the capability of knowledge representation.

The technical scheme of the invention is realized as follows:

a knowledge graph noise detection method based on path confidence includes the following steps:

the method comprises the following steps: initializing the number of triples, finding out all paths of all triples, carrying out embedded representation on each triplet of each path by using a translation model TransE algorithm, and representing all paths of triples as path embedded sequences; a node is formed between adjacent triples in the path embedding sequence, and the number of the nodes is n;

step two: sequentially inputting the nodes to a probability logic layer (CPLL) based on the confidence degree and based on the relevance degree, and calculating a confidence degree score matrix of each node in each path;

step three: respectively inputting the confidence coefficient score matrixes of all nodes in each path into the Bi-GRU to obtain a score matrix of each path;

step four: and taking the L2 norm of the score matrix of each path as the path confidence coefficient, and taking the corresponding score matrix when the path confidence coefficient is highest as the optimal embedding matrix of the triples.

Preferably, in the second step, the specific method is as follows:

s21, initializing the input node T:

T＝N′ _i ·(N′ _i+1 ) ^T (1)；

N′ _i ＝(x′ _i ,r′ _i ,x′ _i+1 ) (2)；

N′ _i+1 ＝(x′ _i+1 ,r′ _i+1 ,x′ _i+2 ) (3)；

wherein, N' _i An embedded matrix, N ', representing the ith triplet on the path' _i+1 An embedded matrix representing the (i +1) th triplet on the path, (N' _i+1 ) ^T Representing triplet embedding matrix N' _i+1 Transpose of x' _i 、x′ _i+1 、x′ _i+2 All represent entity, r' _i And r' _i+1 All represent relationships;

s22, connecting the node T with the parameter matrix W ₀ Multiplying to obtain the global confidence among the triples, namely the global triple confidence:

GTT(i,i+1)＝T·W ₀ (4)；

wherein, GTT (i, i +1) is the confidence of the global triple;

s23, entering into Separate by the node T&In the padd layer, the sub-matrix block T on the diagonal of T is separated ₁ ,T ₂ ,T ₃ Then T is added ₁ ,T ₂ ,T ₃ Respectively with the parameter matrix W ₁ ,W ₂ ,W ₃ Multiplying to obtain D, E and F; and performing logic operation based on the correlation degrees by using the D, the E and the F, and adding to obtain a local confidence coefficient between the triples, namely the local triple confidence coefficient:

T ₁ ＝x′ _i ·x′ _i+1 ,T ₂ ＝r′ _i ·r′ _i+1 ,T ₃ ＝x′ _i+1 ·x′ _i+2 (5)；

D＝T ₁ ·W ₁ ,E＝T ₂ ·W ₂ ,F＝T ₃ ·W ₃ (6)；

wherein MIN (-) represents the minimum value of the matrix, MAX (-) represents the maximum value of the matrix, 1 represents that the elements in the matrix are all 1, -1 represents that the elements in the matrix are all-1,

respectively representing different logic operations, wherein LTT (i, j) is a local triple confidence;

s24, multiplying the confidence coefficient of the global triple and the confidence coefficient of the local triple to obtain the confidence score G of the node T _i ：

G _i ＝GTT(i,i+1)·LTT(i,i+1) (12)。

Preferably, in step three, the specific method is as follows:

s31, selecting the confidence score G of each node _i And confidence G of neighboring nodes _i+1 、G _i-1 As the input of the bidirectional GRU, the calculation modes of the ith forward GRU and the backward GRU are respectively as follows:

wherein,

which represents the output result of the forward GRU,

the output result of backward GRU is shown, and GRU (-) shows a gating cycle network.

S32, performing concatenation, linear and normalization operations on the final outputs of the forward GRU and the backward GRU to obtain a path score matrix:

wherein h (p) represents the output result of the gated loop network, i.e. the path score matrix,

represents the final output result of the forward GRU,

represents the final output result of the backward GRU, concat () represents the join function, line () represents the linear function, and softmax () represents the normalization function.

Preferably, in step four, the path confidence and the optimal triplet are calculated by the following methods:

when in use

When, h (f) _k )＝h(p _j ) (17)；

Wherein g (p) represents path confidence, h (p) _j ) A matrix of the scores of the paths is represented,

l2 function, g (f), representing a matrix _k ) Indicates the maximum path confidence, h (f) _k ) The optimal path score matrix representing the triplet is also the optimal embedding matrix for the triplet.

Preferably, the designed loss function is as follows:

L＝∑ _{(h,r,t)∈{T'∪T”}} log[1+exp(l _(h,r,t) ·P(h,r,t))] (18)；

the method comprises the following steps that exp () represents an exponential function with a natural constant e as a base, log () represents a logarithmic function, L represents a loss function, P (h, r, T) represents a path from a head entity h to a tail entity T, r represents a relation, T 'represents a set of valid triples, T' represents a set of invalid triples, the invalid triples refer to triples formed by randomly switching one head entity or one tail entity of original triples, and the valid triples refer to the original triples.

Compared with the prior art, the invention has the following beneficial effects:

1) on the basis of internal structure information in a knowledge graph based on a path, a probability model based on correlation degree is introduced and fused into a neural network structure to detect noise in the knowledge graph and perform knowledge graph representation.

2) The invention constructs a path confidence network to calculate the global triple confidence and the local triple confidence, and obtains the path confidence and the path score matrix of the triple by combining a bidirectional gating circulation network; the path confidence is used to determine whether the triplet is correct, and the path score matrix is used to represent the triplet.

3) The invention solves the problem of knowledge graph noise, completes the representation of the knowledge graph and obtains good effect in the detection test of the knowledge graph noise.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a sub-graph of all paths from entity "champions" to entity "teams";

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a flow chart of the proposed model of the present invention;

FIG. 4 is a block diagram of a correlation-based probabilistic logic model of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art based on the embodiments of the present invention without inventive step, are within the scope of the present invention.

In general, the existence of some relationship between triplets in a knowledge-graph can be expressed in the form of a path. When the triplet f is expressed as (h, R, t), the path P from the head entity h to the tail entity t as (h, R, t) is an option that cannot be ignored. Wherein, R includes at least one relationship, and possibly several entities, and these entities and relationships may form several triples N, which is referred to as path triples in the present invention. Every two adjacent triplets constitute a node. And R ≧ R, when R ≧ R, path P is equal to f, indicating that f is the shortest path.

There may be multiple paths from the head entity to the tail entity, but some paths are not correct, some paths are not complete, and information in some paths is not suitable for use in the triplet representation. FIG. 1 shows f ₁ The set of all paths for ("champion", "joining", "team"), i.e., the set of paths for the entity "champion" to the entity "team". In FIG. 1, f ₁ Is the shortest path, also the triplet itself, f ₂ ("," ₃ The correct triplet is the "basketball game", "equals" and "match". Thus, f ₂ Or f ₃ The combined path with the other triplets is noisy. These noisy paths must undergo some processing before their path score matrices can be used to represent the triples.

However, most path-based knowledge graph representation methods do not exclude noise contained in the path. But the rule-based approach is well suited to solve the problem of noise contained in the path. Specifically, a confidence level is given to each node in the path to indicate how likely the node is correct, and then a path confidence level is obtained by probability combination, and the path confidence level indicates how likely the path is correct. If the path from the head entity to the tail entity only has the triplet itself, then the triplet is the only node in the path. At this time, the triple confidence, the node confidence and the path confidence are equal. In fact, there may be multiple paths, and it is most appropriate to take the path with the highest path confidence to represent the triplet. If the triples are represented in the form of a matrix, the path score matrix is obtained by the probability combination between the node confidence degrees, and the L2 norm of the path score matrix is used as the confidence degree of the path.

As shown in fig. 2, an embodiment of the present invention provides a method for detecting noise in a knowledge-graph based on path confidence, which includes the following specific steps:

the method comprises the following steps: for the triples with the number of E, finding all paths of all the triples, initializing the number of the triples with the number of E as E, and traversing all the triples; and traversing all paths of the triples, wherein the number of the paths is P, and the number of the initialized paths is P. Embedding each triple of each path by using a translation model TransE algorithm, and representing all paths of the triples as path embedding sequences; a node is formed between adjacent triples in the path embedding sequence, and the number of the nodes is n; the number of initialization nodes is N. The structure of the present invention is shown in fig. 3.

Step two: as shown in fig. 4, the nodes are sequentially input to a probability logic layer (CPLL) based on the correlation, and the confidence score of each node in each path is calculated;

in the second step, the specific method is as follows:

s21, initializing the input node T:

T＝N′ _i ·(N′ _i+1 ) ^T (1)；

N′ _i ＝(x′ _i ,r′ _i ,x′ _i+1 ) (2)；

N′ _i+1 ＝(x′ _i+1 ,r′ _i+1 ,x′ _i+2 ) (3)；

wherein, N' _i ,N′ _i+1 Denote the embedded matrices of the ith and i +1 triplets on the path, respectively, (N' _i+1 ) ^T Representing triplet embedding matrix N' _i+1 Transposed, x' _i 、x′ _i+1 、x′ _i+2 Represents entity r' _i And r' _i+1 Representing the relationship.

S22, connecting the node T with the parameter matrix W ₀ The global confidence between the triples is obtained by multiplying, namely the global triple confidence:

GTT(i,i+1)＝T·W ₀ (4)；

where GTT (i, i +1) is the global triple confidence.

S23, the node T enters separation&In the filling operation layer, on the diagonal of TSub-matrix block T ₁ ,T ₂ ,T ₃ Then T is added ₁ ,T ₂ ,T ₃ Respectively with the parameter matrix W ₁ ,W ₂ ,W ₃ Multiplying to obtain D, E and F; and performing logic operation based on the correlation degrees by using the D, the E and the F, and adding to obtain a local confidence coefficient between the triples, namely the local triple confidence coefficient:

D＝T ₁ ·W ₁ ,E＝T ₂ ·W ₂ ,F＝T ₃ ·W ₃ (6)；

respectively representing different logical operations, and LTT (i, j) is local triple confidence.

S24, multiplying the confidence coefficient of the global triple and the confidence coefficient of the local triple to obtain the confidence coefficient score G of the node T _i ：

G _i ＝GTT(i,i+1)·LTT(i,i+1) (12)。

Step three: respectively inputting the confidence scores of all nodes in each path into a Bi-GRU (bidirectional gated-loop network) according to the front and back sequence to obtain a score matrix of each path;

in the third step, the specific method is as follows:

wherein,

which represents the output result of the forward GRU,

S32, in order to retain the effective information to the maximum, performing the connection, linear and normalization operations on the final outputs of the forward GRU and the backward GRU to obtain the path score matrix:

representing the final output result of the forward GRU,

Step four: and taking the L2 norm of the score matrix of each path as the path confidence coefficient, and taking the corresponding score matrix when the path confidence coefficient is highest as the optimal triple.

In the fourth step, the path confidence and the optimal triplet are calculated by the following methods:

when the temperature is higher than the set temperature

When, h (f) _k )＝h(p _j )(17)；

In order to train the model proposed by the present invention, the designed loss function is as follows:

L＝∑ _{(h,r,t)∈{T'∪T”}} log[1+exp(l _(h,r,t) ·P(h,r,t))] (18)；

the method comprises the following steps that exp () represents an exponential function with a natural constant e as a base, log () represents a logarithmic function, L represents a loss function, P (h, r, T) represents a path from a head entity h to a tail entity T, r represents a relation, T 'represents a set of valid triples, T' represents a set of invalid triples, an invalid triple refers to a triple formed by randomly switching one head entity or one tail entity of an original triple, and a valid triple refers to an original triple.

The present invention uses three reference datasets FB15K, WN18, and NELL995 of knowledge-map noise detection, which are constructed from information extracted from the Freebase, WordNet, and NELL knowledge bases, respectively. Their statistics are listed in table 1.

TABLE 1 statistics of the baseline data sets FB15K, WN18, and NELL995

To evaluate the performance of the model, noise needs to be added to the data set described above. The basic method is as follows: for a given positive triplet (h, r, t), one of the head or tail entities is randomly switched to form a negative triplet (h ', r, t) or (h, r, t') as noise. In this way, a data set containing 10%, 20%, 40% noise is constructed for each reference data set. These noisy data sets share the same entity, relationship, validation, and test sets as the original data set, and all the noise generated is fused into the original training set.

The invention combines the L2 norm of the path score matrix

As path confidences, all triples in the training set are then ranked according to these path confidences. The greater the path confidence of a triplet, the more effective the triplet is represented.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A knowledge graph noise detection method based on path confidence is characterized by comprising the following steps:

in the second step, the specific method is as follows:

s21, initializing the input node T:

GTT(i,i+1)＝T·W ₀ (4)；

wherein, GTT (i, i +1) is the confidence of the global triple;

s23, entering into Separate node T&In the padd layer, the sub-matrix block T on the diagonal of T is separated ₁ ,T ₂ ,T ₃ Then T is added ₁ ,T ₂ ,T ₃ Respectively with the parameter matrix W ₁ ,W ₂ ,W ₃ Multiplying to obtain D, E and F; and performing logic operation based on the correlation degrees by using the D, the E and the F, and adding to obtain a local confidence coefficient between the triples, namely the local triple confidence coefficient:

D＝T ₁ ·W ₁ ,E＝T ₂ ·W ₂ ,F＝T ₃ ·W ₃ (6)；

respectively representing different logic operations, wherein LTT (i, j) is a local triple confidence coefficient;

G _i ＝GTT(i,i+1)·LTT(i,i+1) (12)；

2. The method for detecting knowledge-graph noise based on path confidence as claimed in claim 1, wherein in step three, the specific method is:

wherein,

the output result of the forward GRU is represented,

representing the output result of backward GRU, GRU (-) represents the gating cycle network;

wherein h (p) represents the output result of the gated cyclic network, i.e., the path score matrix,

representing the final output result of the forward GRU,

3. The method for knowledge-graph noise detection based on path confidence as claimed in claim 2, wherein in step four, the path confidence and the optimal triplet are calculated by:

when in use

When, h (f) _k )＝h(p _j )(17)；

Wherein g (p) represents path confidence, h (p) _j ) A matrix of the path scores is represented,

4. The method of knowledge-graph noise detection based on path confidence of claim 3, wherein the designed loss function is as follows:

L＝∑ _{(h,r,t)∈{T'∪T”}} log[1+exp(l _(h,r,t) ·P(h,r,t))] (18)；