CN111078896A

CN111078896A - Knowledge base completion method based on PRMATC algorithm

Info

Publication number: CN111078896A
Application number: CN201911308709.7A
Authority: CN
Inventors: 汪璟玢; 张梨贤
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-04-28
Anticipated expiration: 2039-12-18
Also published as: CN111078896B

Abstract

The invention relates to a knowledge base completion method based on a PRMATC algorithm, which comprises the following steps: step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database; step S2, constructing and training a BILSTM-CRF model; step S3, identifying and classifying entities on two sides of the relationship through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relationship, and step S4, improving an FP-Growth algorithm; step S5, digging out implicit strong association rules among the affairs; step S6, converting the definition domain of the obtained relation and the strong association rule into a Horn logic rule, and step S7, obtaining new knowledge according to the obtained Horn logic rule and adding the new knowledge to a knowledge base KB. The method can efficiently find the horns rule of the representative knowledge base, is better than other rule mining systems in terms of the quantity and accuracy of mining rules, and can better complement the knowledge base.

Description

Knowledge base completion method based on PRMATC algorithm

Technical Field

The invention relates to the field of mass data storage and reasoning under a knowledge graph, in particular to a knowledge base completion method based on a PRMATC algorithm.

Background

Mining the Horn rules from the large-scale semantic network knowledge base and further utilizing the rules to help deduce and add the knowledge lacking in the knowledge base is one of the most effective means for realizing the dynamic growth of the knowledge base. The association rule mining algorithm is one of important algorithms in the field of data mining, and aims to mine implicit relationships existing among transactions. Conventional algorithms include Apr i or i algorithm [1] and FP-Growth algorithm [2 ]. The traditional association rule mining algorithm has good effect on small-scale data concentration, but with the rapid development of internet technology in recent years, network data is increased explosively, and the traditional association rule mining algorithm has the problems that a single node cannot calculate, the running memory is insufficient and the like, so that the requirement of network big data cannot be met.

Disclosure of Invention

In view of this, the present invention provides a knowledge base completion method based on the PRMATC algorithm, which can efficiently dig a group of horns logic rules capable of representing semantic information of a knowledge base and better complete the knowledge base.

In order to achieve the purpose, the invention adopts the following technical scheme:

a knowledge base completion method based on a PRMATC algorithm comprises the following steps:

step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database;

step S2, constructing and training a BILSTM-CRF model;

step S3, identifying and classifying entities on two sides of the relation through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relation;

step S4, optimizing data balance grouping and FP tree construction and excavation on the basis of the FP-Growth algorithm to obtain an improved FP-Growth algorithm;

step S5, digging out implicit strong association rules among the transactions according to the improved FP-Growth algorithm;

step S6, converting the domain and the strong association rule into a horny logic rule according to the obtained relationship;

and step S7, acquiring new knowledge according to the acquired Horn logic rule, and adding the new knowledge to the knowledge base KB.

Further, the BILSTM-CRF model consists of two parts, namely a bidirectional LSTM and a CRF.

Further, the bidirectional LSTM is composed of a forward LSTM and a backward LSTM;

the LSTM calculation process is realized by forgetting and memorizing information in the cell state, wherein the forgetting, memorizing and outputting are from a previous hidden layer state h_t-1And current input X_tThe specific calculation formula 4 is determined.

In the formula (4), X_t、C_t、h_t、f_t、i_t、O_tRespectively corresponding to the input, cell state, hidden layer state, forgetting gate and output gate of the model at the moment t; the word vector is input to the BILSTM layer, and the output values are the predicted scores for each label corresponding to each word in a sentence, which are input to the CRF layer.

Further, the CRF layer employs a linear conditional random field P (y | x) as shown in the following formula:

in formula (5) < lambda >_kAnd mu_lIs a weight coefficient, t_kAnd s_lFor the characteristic function, Z (x) is a normalization factor

The output of the BILSTM layer is used as the input of a CRF layer, and after the CRF layer characteristic function operation and normalization operation, the legal prediction label of each word is output.

Further, the step S3 is specifically:

step S31, each input triplet X is set to (X)₁,x₂,...x_i,...x_n) Obtaining all possible prediction sequences y ═ y (y) through a BILSTM layer and a CRF layer₁,y₂,...,y_i,...y_n)；

The score S (X | y) for each predicted sequence y is shown as follows:

in the formula (7)

Output y for the ith position_iA is a transition probability matrix

And step S32, calculating the maximum score y of the sequence as shown in the following formula:

y^*＝argmax_y∈YS(X|y)

step S33, converting through a relation type constraint conversion function to obtain a definition domain and a value domain of each relation in the knowledge base, wherein the relation type constraint conversion function f is as follows:

f({t₁,t₂,...t_i,...,t_n})＝(p_d,p,p_r)

in the formula t_i＝(s_i,p_i,o_i)、t_j＝(s_j,p_j,o_j) A fact triplet representing the relationship p,

converting the entity classes on two sides of the relation according to the following formula to obtain the definition domain and value domain of the relation

s_iSubClassOf El_si，o_iSubClassOf El_oi，s_jSubClassOf El_sj，o_jSubClassOfEl_oj， El_si,El_oi,El_sj,

Wherein El_si、El_oi、El_sj、El_ojRespectively representing entities s_i、o_i、s_j、o_jThe sub-class to which the current packet belongs,

respectively representing entities s_i、o_i、s_j、o_jBelonging to a broad category.

Further, the optimized data balance grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then, a relationship path relevant to the relationship is divided into the same partition, so that data balance and independent grouping are realized.

The step S4 specifically includes:

step S41; traversing transaction T on a case-by-case basis_iWill T_iThe process is traversed from the front to the back,

step S42: according to item a₁Determining whether the partition with the item as the root node exists, if so, returning the partition number, and otherwise, adding the partition information with the item as the root node and returning;

step S44: according to block number and item a_iFirst, find whether there is an item that is the same as the item and is the same as the ancestor node, if there is the item count plus 1, otherwise add the item to the specified block.

Step S41: finding the block number of the item m in the owned item set, then searching all ancestor nodes of the item m in the corresponding block, namely the conditional mode base of the item m

Step S45: the conditional pattern base of m is < (f:2), (c:2), (a:2) > and < (f:1), (c:1), (a:1), (b:1) >, and similarly the conditional pattern base of p is < (f:2), (c:2), (a:2), (m:2) > and < (c:1), (b:1) >, and the conditional pattern base of each item is taken as the input of the mapper stage of the item, creating a conditional FP tree, and mining the frequent item set of the item.

Further, the step S6 is specifically:

step S61, digging out strong association rules obtained through steps S3 and S4

Definition domain and value domain El of sum relation_id,El_ir,r_jdomain El_jd,r_irangeEl_jr,r_zdomain El_zd,r_zrange El_zr；

Step S62, converting the strong association rule into Horn rule according to the following formula

Wherein El_id、El_irRespectively represent the relationship r_iDefinition and value ranges of_jd、El_jrRepresents the relation r_jDefinition and value ranges of_zd、El_zrRepresents the relation r_zThe definition domain and the value domain.

Compared with the prior art, the invention has the following beneficial effects:

the method can efficiently find the horns rule of the representative knowledge base, is better than other rule mining systems in terms of the quantity and accuracy of mining rules, and can better complement the knowledge base.

Drawings

FIG. 1 is a flow chart of a method in one embodiment of the present invention;

FIG. 2 is an exemplary diagram illustrating implementation of knowledge base completion using horns logic rules in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram of the PRMATC algorithm in accordance with one embodiment of the present invention;

FIG. 4 is a schematic diagram of the BILSTM-CRF model in accordance with an embodiment of the present invention;

FIG. 5 is a graph of inter-cluster overlap in an embodiment of the present invention;

FIG. 6 is a schematic diagram of an optimized chain head table structure according to an embodiment of the present invention

FIG. 7 is a modified frequent pattern tree in an embodiment of the invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

Referring to fig. 1, the present invention provides a method for complementing a knowledge base based on a PRMATC algorithm, which includes the following steps:

step S2, constructing and training a BILSTM-CRF model;

In the present embodiment, setting t ═ s, p, o > represents an example triplet. Where s denotes a Subject (Subject), p denotes a Predicate (Predicate), and o denotes an Object (Object). An RDF data graph is composed of a plurality of instance triples.

Is composed of a series ofThe directed graph formed by the interconnection of RDF instance triples is called RDF data graph rg, rg ═ t { (t)₁,t₂,...,t_i,...,t_n}，

t_iNode s in_i，o_iIs the vertex in the figure, p_iIs a directed edge in the graph, and the starting node of the directed edge is s_iThe termination node is o_i。

Given a triple t_i(s_i,p_i,o_i) And t_j(s_j,p_j,o_j) If(s)_i＝s_j&&o_i≠o_j) Or(s)_i＝o_j&&o_i≠s_j) Or (o)_i＝s_j&&s_i≠o_j) Or (o)_i＝o_j&&s_i≠s_j) Then call t_iAnd t_jAdjacent, triple connections may be made.

Knowledge base KB ═ E, R, F, P, V >, where E denotes the set of entities, R denotes the set of relationships, F denotes the set of facts in the knowledge base, P denotes the set of properties, and V denotes the set of values.

Entity set E ═ E1, E2., en ═ subject (kb) ∪ object (kb), which describes all entities in the semantic network knowledge base data layer, and corresponds to the set of instances in RDF.

A relationship set R ═ { R1, R2., rn } ═ relationship (kb), which represents relationships between entities.

Fact set

It represents the set of all instance triplets in the knowledge base.

The attribute set P represents the set P of global attributes, P { P1, P2.., pn }, which associates E with the attribute value V.

The attribute value set V represents a set V ═ V of overall attribute values₁,v₂,...,v_nIt represents nodes such as text.

Let entity tag set EL ═ El₁,El₂,...,El_nIt represents a set of labels that can represent all entity classes in the knowledge base. For commonly used data sets such as YAGO and DBpedia, this embodiment expands PER, LOC and ORG, respectively, and defines 39 types as entity tag sets in this document, denoted as EL, where Cf ═ { PER | ORG | LOC } denotes a set of three major classes. As shown in table 1.

TABLE 1 entity tag set

In this embodiment, a more common sequence labeling mode BIO is adopted, where B denotes start (Begin), I denotes middle (Intermediate), and O denotes Other (Other) for labeling unrelated characters.

In this embodiment, the Redis distributed memory database cluster stores the definition domain and value domain of each relationship in the knowledge base and the horns logic rules mined by the algorithm. Specific tables and stored contents are shown in table 2.

TABLE 2 Redis table design and storage description

The BILSTM-CRF model in the embodiment is composed of two-way LSTM and CRF, wherein the two-way LSTM is composed of forward LSTM and backward LSTM;

the LSTM calculation process is realized by forgetting and memorizing information in the cell state, wherein the forgetting, memorizing and outputting are from a previous hidden layer state h_t-1And current input X_tThe specific calculation formula 1 is shown.

f_t＝σ(W_f·[h_t-1,x_t]+b_f)

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

O_t＝σ(W_o·[h_t-1,x_t]+b_o)

h_t＝O_t*tanh(C_t)

In the formula, X_t、C_t、h_t、f_t、i_t、O_tRespectively corresponding to the input, cell state, hidden layer state, forgetting gate and output gate of the model at the moment t; the word vector is input to the BILSTM layer, and the output value is the predicted score for each label corresponding to each word in a sentence, which is input to the CRF layer. The embodiment employs the BIO tagging mode, so each word corresponds to 79 tag scores.

The bidirectional LSTM can effectively combine the context of words and words, and can better identify entities and predict type labels corresponding to the entities. For example, we encode "yaoming nationality china", and input "yaoming", "nationality" and "china" sequentially from the forward LSTM to obtain three vectors hl0, hl1 and hl2, and input "china", "nationality" and "yaoming" sequentially from the LSTM to obtain three vectors hr0, hr1 and hr2, respectively, and the last vector is obtained by splicing the forward vector and the backward vector, so each word vector contains richer corpus information, and the entity recognition accuracy is higher.

CRF layer: conditional Random Field (CRF) [9] is a conditional probability distribution model for a given set of input sequences for another set of output sequences. It can be easily found that even if no CRF layer is provided, named entity recognition and prediction can be completed only through the BILSTM model, because the output of the BILSTM layer is the prediction score of each label corresponding to each word, and the label with the highest score of each word can be selected to be combined into the best prediction label. However, in many cases the highest scoring sequence is not legal, e.g., "B-PER I-PER" is valid, but "B-PER I-ORG" is not, the role of the CRF layer may add some constraints to the last predicted tag to guarantee the validity of the predicted tag. For named entity recognition sequence tagging problems, linear conditional random fields (linear-CRF) are typically employed.

The linear conditional random field P (y | x), is given by:

in the formula of_kAnd mu_lIs a weight coefficient, t_kAnd s_lFor the characteristic function, Z (x) is a normalization factor

In this embodiment, the prediction can be performed after the model training is completed. Each RDF triplet < s, p, o > in the knowledge base is taken as input, e.g. "yaoming china". And (4) according to the predicted time, obtaining all possible predicted sequence scores in the input sentence according to the trained model parameters, and taking the maximum value. The step S3 specifically includes:

The score S (X | y) for each predicted sequence y is shown as follows:

in the formula

Output y for the ith position_iA is a transition probability matrix

y^*＝argmax_y∈YS(X|y)

f({t₁,t₂,...t_i,...,t_n})＝(p_d,p,p_r)

In this embodiment, the optimized data balance grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then divides a relationship path relevant to the relationship into the same partition, thereby realizing data balance and independent grouping.

In this embodiment, a balanced grouping strategy is obtained by comprehensively considering the time complexity and the space complexity, that is, a highly relevant relationship is automatically found by a clustering algorithm, and then a relationship path (transaction) related to the relationship is divided into the same partition, so that data are truly grouped uniformly and independently. Relationships that share more common similar paths should be coupled more particularly, we start with clusters of relationships | R |, each cluster representing a relationship R ∈ R, each point within a cluster representing a relationship path associated with the relationship, and then iteratively compute the distance d between each cluster and the remaining clusters. It essentially measures the degree of overlap between two clusters, the greater the overlap, the higher the similarity. Relationships that share a large number of common similar paths will be partitioned into the same partition. The similarity of the relationship is measured by calculating the circle center distance d between two clusters, and the smaller d is, the higher the similarity is. As shown in particular in fig. 5.

The circle center distance d between two clusters in the two-dimensional space is calculated according to the following formula:

as shown in fig. 5, if d satisfies (a) in fig. 5 or (b) in fig. 5, the two relationships are considered to be similar to some extent, and the smaller d, the higher the similarity, otherwise discrete and independent. The specific steps of the clustering algorithm are as follows.

Line 3 converts each relationship path into a 100-dimensional vector through word2 Vec; in lines 4-5, the dimension of the high-dimensional data is reduced by TSNE, and the dimension is 2; the 6 th row returns each relationship r and the corresponding relationship path set p; lines 7-11 first perform outlier detection on the relationship path set of each relationship, and then determine the circle center coordinate O and radius m of the cluster represented by the relationship through the distance function.

The basic process of excavating a frequent item set by the FP-growth algorithm is divided into two parts: constructing the FP tree and mining a frequent item set from the FP tree.

(1) Building FP Tree

When an existing FP-Growth algorithm constructs a frequent pattern tree, after any transaction executes an insertion operation, a transaction data set needs to be updated by adopting a sorting method, and the sorting is based on the specific position of an item of the transaction in a chain head table. In order to reduce the time complexity, an algorithm for optimizing and constructing a frequent pattern tree is provided. The storage structure used by the algorithm is defined as follows:

linkList＝{＜root_i,block_i,itemSet_i＞},

bloack_i＝{＜item_ij,{(frequencyItem_ijk,ancestorNode_ijk)}＞}

root_i＝item_i1

itemSet_i＝{item_i1,...,item_ij}

the present embodiment is described by taking a transaction data set D as an example, and the detailed information of the data set is shown in table 4.

Table 4 transaction data D

Setting the minimum support degree of the data set as 3, and sorting the support degrees by adopting a principle from big to small to obtain a result: f:4, c:3, a:3, b:3, m:3 and p:3, and sorting the original data set according to the result in a decreasing support degree mode, wherein the result is shown in the rightmost column of the table 4. The SFP algorithm uses two data structures: a list of chain heads and a frequent pattern tree. The principle of optimizing the structure of the chain head table is given below, as shown in fig. 6.

The pseudo code of the specific steps for constructing the frequent pattern tree is as follows:

transaction T is traversed bar by bar starting at line 3 of code_iWill T_iGo through from front to back, lines 4-7 make a judgment, according to item a₁Determining whether a block using the item as a root node exists, if so, returning a block number, otherwise, adding block information using the item as the root node and returning (corresponding to ① in FIG. 5), and the 8 th line according to the block number and the item a_iFirst, find if there is an entry that is the same as the entry and is the same as the ancestor node, add 1 to the entry count if there is, otherwise add the entry to the specified block (corresponding to ② in FIG. 5).

From the time complexity, assuming that each transaction in the transaction database contains k entries, the number of elements in the frequent entry set is m, and the total number of the transactions is n, the time complexity of inserting each transaction into the frequent pattern tree is O (m) according to the original entry header table structure²) The time complexity for constructing the whole frequent pattern tree is O (m)²N); and in the improved linked list structure, the time complexity of inserting each transaction into the frequent pattern tree is O (k), and the time complexity of constructing the whole frequent pattern tree is O (k × n). As shown in fig. 5, the left graph is the frequent pattern tree before modification, and the right graph is the frequent pattern tree after modification.

Before improving the frequent pattern tree, the time complexity of searching the child node is O (m), and the time complexity of the chain head table structure after the improvement is reduced to O (1).

Although recursive algorithms make the codes easier to understand and simpler, the overhead of time and space caused by the recursive algorithms makes the algorithms inefficient to execute, so that the mining efficiency of frequent item sets can be improved by reducing the recursive operations.

(2) Mining frequent item sets

Taking the frequent pattern tree constructed by the transaction database D in table 4 as an example, it is assumed that the conditional pattern base of the entry m is to be found. Firstly, finding the block number of an item m in a owned item set, and then searching all ancestor nodes of the item m in the corresponding block, wherein the ancestor nodes are the conditional mode base of the item m. As shown in FIG. 4, the conditional mode bases of m are < (f:2), (c:2), (a:2) > and < (f:1), (c:1), (a:1), (b:1) >, and the conditional mode bases of p are < (f:2), (c:2), (a:2), (m:2) > and < (c:1), (b:1) >. Then, taking the conditional mode base of each item as the input of the mapper stage of the item, creating a conditional FP tree, and mining the frequent item set of the item.

Further, the step S6 is specifically:

step S61, digging out strong association rules obtained through steps S3 and S4

Definition domain and value domain El of sum relation_id,El_ir,r_jdomainEl_jd,r_irangeEl_jr,r_zdomainEl_zd,r_zrangeEl_zr；

In this embodiment, two strong association rules mined by the SFP algorithm are taken as an example, and the advantage of converting the strong association rules into the horns logic rules through the relationship type constraint is described in comparison.

(1)

The generated Horn logic rule is as follows: .

(2)

The generated Horn logic rules are shown below:

it is easy to find that although a rational horns logical rule is generated by using the relationship type constraint in the expression (15), the directions of the relationships are not always consistent in many cases, and are similar to the expression (16) in many cases. Using relationship type constraints can specify directions for relationships because the connection entities sharing variables should belong to the same tag type, which makes the converted horns logical rules more complete.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A knowledge base completion method based on a PRMATC algorithm is characterized by comprising the following steps:

step S2, constructing and training a BILSTM-CRF model;

step S6, converting the definition domain and the strong association rule of the obtained relationship into a horns logic rule;

2. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the BILSTM-CRF model consists of two parts, namely a bidirectional LSTM and a CRF.

3. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 2, wherein: the bidirectional LSTM is composed of a forward LSTM and a backward LSTM;

4. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 2, wherein: the CRF layer employs a linear conditional random field P (y | x) as shown below:

And the output of the BILSTM layer is used as the input of a CRF layer, and a legal prediction label of each word is output after the CRF layer characteristic function operation and the normalization operation.

5. The method for complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein the step S3 specifically comprises:

The score S (X | y) for each predicted sequence y is shown as follows:

in the formula (7)

Output y for the ith position_iA is a transition probability matrix

y^*＝argmax_y∈YS(X|y)

step S33, converting through a relation type constraint conversion function, wherein the relation type constraint conversion function f is as follows:

f({t₁,t₂,...t_i,...,t_n})＝(p_d,p,p_r)

s_iSubClassOf El_si，o_iSubClassOf El_oi，s_jSubClassOf El_sj，o_jSubClassOf El_oj，El_si,El_oi,El_sj,

6. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the optimized data balanced grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then divides a relationship path relevant to the relationship into the same partition, so that data balanced and independent grouping is realized.

7. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the step S4 specifically includes:

step S44: according to block number and item a_iFirstly, searching whether an item which is the same as the item and is the same as the ancestor node exists, if so, adding 1 to the item count, otherwise, adding the item to the specified block.

8. The method for complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein the step S6 specifically comprises:

step S61, digging out strong association rules obtained through steps S3 and S4

Definition domain and value domain El of sum relation_id,El_ir,r_jdomain El_jd,r_irange El_jr,r_zdomain El_zd,r_zrange El_zr；

Wherein El_id、El_irIndividual watchShows the relation r_iDefinition and value ranges of_jd、El_jrRepresents the relation r_jDefinition and value ranges of_zd、El_zrRepresents the relation r_zThe definition domain and the value domain.