CN111078896A - Knowledge base completion method based on PRMATC algorithm - Google Patents
Knowledge base completion method based on PRMATC algorithm Download PDFInfo
- Publication number
- CN111078896A CN111078896A CN201911308709.7A CN201911308709A CN111078896A CN 111078896 A CN111078896 A CN 111078896A CN 201911308709 A CN201911308709 A CN 201911308709A CN 111078896 A CN111078896 A CN 111078896A
- Authority
- CN
- China
- Prior art keywords
- item
- knowledge base
- algorithm
- domain
- relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Fuzzy Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a knowledge base completion method based on a PRMATC algorithm, which comprises the following steps: step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database; step S2, constructing and training a BILSTM-CRF model; step S3, identifying and classifying entities on two sides of the relationship through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relationship, and step S4, improving an FP-Growth algorithm; step S5, digging out implicit strong association rules among the affairs; step S6, converting the definition domain of the obtained relation and the strong association rule into a Horn logic rule, and step S7, obtaining new knowledge according to the obtained Horn logic rule and adding the new knowledge to a knowledge base KB. The method can efficiently find the horns rule of the representative knowledge base, is better than other rule mining systems in terms of the quantity and accuracy of mining rules, and can better complement the knowledge base.
Description
Technical Field
The invention relates to the field of mass data storage and reasoning under a knowledge graph, in particular to a knowledge base completion method based on a PRMATC algorithm.
Background
Mining the Horn rules from the large-scale semantic network knowledge base and further utilizing the rules to help deduce and add the knowledge lacking in the knowledge base is one of the most effective means for realizing the dynamic growth of the knowledge base. The association rule mining algorithm is one of important algorithms in the field of data mining, and aims to mine implicit relationships existing among transactions. Conventional algorithms include Apr i or i algorithm [1] and FP-Growth algorithm [2 ]. The traditional association rule mining algorithm has good effect on small-scale data concentration, but with the rapid development of internet technology in recent years, network data is increased explosively, and the traditional association rule mining algorithm has the problems that a single node cannot calculate, the running memory is insufficient and the like, so that the requirement of network big data cannot be met.
Disclosure of Invention
In view of this, the present invention provides a knowledge base completion method based on the PRMATC algorithm, which can efficiently dig a group of horns logic rules capable of representing semantic information of a knowledge base and better complete the knowledge base.
In order to achieve the purpose, the invention adopts the following technical scheme:
a knowledge base completion method based on a PRMATC algorithm comprises the following steps:
step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database;
step S2, constructing and training a BILSTM-CRF model;
step S3, identifying and classifying entities on two sides of the relation through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relation;
step S4, optimizing data balance grouping and FP tree construction and excavation on the basis of the FP-Growth algorithm to obtain an improved FP-Growth algorithm;
step S5, digging out implicit strong association rules among the transactions according to the improved FP-Growth algorithm;
step S6, converting the domain and the strong association rule into a horny logic rule according to the obtained relationship;
and step S7, acquiring new knowledge according to the acquired Horn logic rule, and adding the new knowledge to the knowledge base KB.
Further, the BILSTM-CRF model consists of two parts, namely a bidirectional LSTM and a CRF.
Further, the bidirectional LSTM is composed of a forward LSTM and a backward LSTM;
the LSTM calculation process is realized by forgetting and memorizing information in the cell state, wherein the forgetting, memorizing and outputting are from a previous hidden layer state ht-1And current input XtThe specific calculation formula 4 is determined.
In the formula (4), Xt、Ct、ht、ft、it、OtRespectively corresponding to the input, cell state, hidden layer state, forgetting gate and output gate of the model at the moment t; the word vector is input to the BILSTM layer, and the output values are the predicted scores for each label corresponding to each word in a sentence, which are input to the CRF layer.
Further, the CRF layer employs a linear conditional random field P (y | x) as shown in the following formula:
in formula (5) < lambda >kAnd mulIs a weight coefficient, tkAnd slFor the characteristic function, Z (x) is a normalization factor
The output of the BILSTM layer is used as the input of a CRF layer, and after the CRF layer characteristic function operation and normalization operation, the legal prediction label of each word is output.
Further, the step S3 is specifically:
step S31, each input triplet X is set to (X)1,x2,...xi,...xn) Obtaining all possible prediction sequences y ═ y (y) through a BILSTM layer and a CRF layer1,y2,...,yi,...yn);
The score S (X | y) for each predicted sequence y is shown as follows:
And step S32, calculating the maximum score y of the sequence as shown in the following formula:
y*=argmaxy∈YS(X|y)
step S33, converting through a relation type constraint conversion function to obtain a definition domain and a value domain of each relation in the knowledge base, wherein the relation type constraint conversion function f is as follows:
f({t1,t2,...ti,...,tn})=(pd,p,pr)
in the formula ti=(si,pi,oi)、tj=(sj,pj,oj) A fact triplet representing the relationship p,
converting the entity classes on two sides of the relation according to the following formula to obtain the definition domain and value domain of the relation
Wherein Elsi、Eloi、Elsj、ElojRespectively representing entities si、oi、sj、ojThe sub-class to which the current packet belongs,respectively representing entities si、oi、sj、ojBelonging to a broad category.
Further, the optimized data balance grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then, a relationship path relevant to the relationship is divided into the same partition, so that data balance and independent grouping are realized.
The step S4 specifically includes:
step S41; traversing transaction T on a case-by-case basisiWill TiThe process is traversed from the front to the back,
step S42: according to item a1Determining whether the partition with the item as the root node exists, if so, returning the partition number, and otherwise, adding the partition information with the item as the root node and returning;
step S44: according to block number and item aiFirst, find whether there is an item that is the same as the item and is the same as the ancestor node, if there is the item count plus 1, otherwise add the item to the specified block.
Step S41: finding the block number of the item m in the owned item set, then searching all ancestor nodes of the item m in the corresponding block, namely the conditional mode base of the item m
Step S45: the conditional pattern base of m is < (f:2), (c:2), (a:2) > and < (f:1), (c:1), (a:1), (b:1) >, and similarly the conditional pattern base of p is < (f:2), (c:2), (a:2), (m:2) > and < (c:1), (b:1) >, and the conditional pattern base of each item is taken as the input of the mapper stage of the item, creating a conditional FP tree, and mining the frequent item set of the item.
Further, the step S6 is specifically:
step S61, digging out strong association rules obtained through steps S3 and S4Definition domain and value domain El of sum relationid,Elir,rjdomain Eljd,rirangeEljr,rzdomain Elzd,rzrange Elzr;
Step S62, converting the strong association rule into Horn rule according to the following formula
Wherein Elid、ElirRespectively represent the relationship riDefinition and value ranges ofjd、EljrRepresents the relation rjDefinition and value ranges ofzd、ElzrRepresents the relation rzThe definition domain and the value domain.
Compared with the prior art, the invention has the following beneficial effects:
the method can efficiently find the horns rule of the representative knowledge base, is better than other rule mining systems in terms of the quantity and accuracy of mining rules, and can better complement the knowledge base.
Drawings
FIG. 1 is a flow chart of a method in one embodiment of the present invention;
FIG. 2 is an exemplary diagram illustrating implementation of knowledge base completion using horns logic rules in accordance with an embodiment of the present invention;
FIG. 3 is a block diagram of the PRMATC algorithm in accordance with one embodiment of the present invention;
FIG. 4 is a schematic diagram of the BILSTM-CRF model in accordance with an embodiment of the present invention;
FIG. 5 is a graph of inter-cluster overlap in an embodiment of the present invention;
FIG. 6 is a schematic diagram of an optimized chain head table structure according to an embodiment of the present invention
FIG. 7 is a modified frequent pattern tree in an embodiment of the invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
Referring to fig. 1, the present invention provides a method for complementing a knowledge base based on a PRMATC algorithm, which includes the following steps:
step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database;
step S2, constructing and training a BILSTM-CRF model;
step S3, identifying and classifying entities on two sides of the relation through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relation;
step S4, optimizing data balance grouping and FP tree construction and excavation on the basis of the FP-Growth algorithm to obtain an improved FP-Growth algorithm;
step S5, digging out implicit strong association rules among the transactions according to the improved FP-Growth algorithm;
step S6, converting the domain and the strong association rule into a horny logic rule according to the obtained relationship;
and step S7, acquiring new knowledge according to the acquired Horn logic rule, and adding the new knowledge to the knowledge base KB.
In the present embodiment, setting t ═ s, p, o > represents an example triplet. Where s denotes a Subject (Subject), p denotes a Predicate (Predicate), and o denotes an Object (Object). An RDF data graph is composed of a plurality of instance triples.
Is composed of a series ofThe directed graph formed by the interconnection of RDF instance triples is called RDF data graph rg, rg ═ t { (t)1,t2,...,ti,...,tn},tiNode s ini,oiIs the vertex in the figure, piIs a directed edge in the graph, and the starting node of the directed edge is siThe termination node is oi。
Given a triple ti(si,pi,oi) And tj(sj,pj,oj) If(s)i=sj&&oi≠oj) Or(s)i=oj&&oi≠sj) Or (o)i=sj&&si≠oj) Or (o)i=oj&&si≠sj) Then call tiAnd tjAdjacent, triple connections may be made.
Knowledge base KB ═ E, R, F, P, V >, where E denotes the set of entities, R denotes the set of relationships, F denotes the set of facts in the knowledge base, P denotes the set of properties, and V denotes the set of values.
Entity set E ═ E1, E2., en ═ subject (kb) ∪ object (kb), which describes all entities in the semantic network knowledge base data layer, and corresponds to the set of instances in RDF.
A relationship set R ═ { R1, R2., rn } ═ relationship (kb), which represents relationships between entities.
The attribute set P represents the set P of global attributes, P { P1, P2.., pn }, which associates E with the attribute value V.
The attribute value set V represents a set V ═ V of overall attribute values1,v2,...,vnIt represents nodes such as text.
Let entity tag set EL ═ El1,El2,...,ElnIt represents a set of labels that can represent all entity classes in the knowledge base. For commonly used data sets such as YAGO and DBpedia, this embodiment expands PER, LOC and ORG, respectively, and defines 39 types as entity tag sets in this document, denoted as EL, where Cf ═ { PER | ORG | LOC } denotes a set of three major classes. As shown in table 1.
TABLE 1 entity tag set
In this embodiment, a more common sequence labeling mode BIO is adopted, where B denotes start (Begin), I denotes middle (Intermediate), and O denotes Other (Other) for labeling unrelated characters.
In this embodiment, the Redis distributed memory database cluster stores the definition domain and value domain of each relationship in the knowledge base and the horns logic rules mined by the algorithm. Specific tables and stored contents are shown in table 2.
TABLE 2 Redis table design and storage description
The BILSTM-CRF model in the embodiment is composed of two-way LSTM and CRF, wherein the two-way LSTM is composed of forward LSTM and backward LSTM;
the LSTM calculation process is realized by forgetting and memorizing information in the cell state, wherein the forgetting, memorizing and outputting are from a previous hidden layer state ht-1And current input XtThe specific calculation formula 1 is shown.
ft=σ(Wf·[ht-1,xt]+bf)
it=σ(Wi·[ht-1,xt]+bi)
Ot=σ(Wo·[ht-1,xt]+bo)
ht=Ot*tanh(Ct)
In the formula, Xt、Ct、ht、ft、it、OtRespectively corresponding to the input, cell state, hidden layer state, forgetting gate and output gate of the model at the moment t; the word vector is input to the BILSTM layer, and the output value is the predicted score for each label corresponding to each word in a sentence, which is input to the CRF layer. The embodiment employs the BIO tagging mode, so each word corresponds to 79 tag scores.
The bidirectional LSTM can effectively combine the context of words and words, and can better identify entities and predict type labels corresponding to the entities. For example, we encode "yaoming nationality china", and input "yaoming", "nationality" and "china" sequentially from the forward LSTM to obtain three vectors hl0, hl1 and hl2, and input "china", "nationality" and "yaoming" sequentially from the LSTM to obtain three vectors hr0, hr1 and hr2, respectively, and the last vector is obtained by splicing the forward vector and the backward vector, so each word vector contains richer corpus information, and the entity recognition accuracy is higher.
CRF layer: conditional Random Field (CRF) [9] is a conditional probability distribution model for a given set of input sequences for another set of output sequences. It can be easily found that even if no CRF layer is provided, named entity recognition and prediction can be completed only through the BILSTM model, because the output of the BILSTM layer is the prediction score of each label corresponding to each word, and the label with the highest score of each word can be selected to be combined into the best prediction label. However, in many cases the highest scoring sequence is not legal, e.g., "B-PER I-PER" is valid, but "B-PER I-ORG" is not, the role of the CRF layer may add some constraints to the last predicted tag to guarantee the validity of the predicted tag. For named entity recognition sequence tagging problems, linear conditional random fields (linear-CRF) are typically employed.
The linear conditional random field P (y | x), is given by:
in the formula ofkAnd mulIs a weight coefficient, tkAnd slFor the characteristic function, Z (x) is a normalization factor
The output of the BILSTM layer is used as the input of a CRF layer, and after the CRF layer characteristic function operation and normalization operation, the legal prediction label of each word is output.
In this embodiment, the prediction can be performed after the model training is completed. Each RDF triplet < s, p, o > in the knowledge base is taken as input, e.g. "yaoming china". And (4) according to the predicted time, obtaining all possible predicted sequence scores in the input sentence according to the trained model parameters, and taking the maximum value. The step S3 specifically includes:
step S31, each input triplet X is set to (X)1,x2,...xi,...xn) Obtaining all possible prediction sequences y ═ y (y) through a BILSTM layer and a CRF layer1,y2,...,yi,...yn);
The score S (X | y) for each predicted sequence y is shown as follows:
And step S32, calculating the maximum score y of the sequence as shown in the following formula:
y*=argmaxy∈YS(X|y)
step S33, converting through a relation type constraint conversion function to obtain a definition domain and a value domain of each relation in the knowledge base, wherein the relation type constraint conversion function f is as follows:
f({t1,t2,...ti,...,tn})=(pd,p,pr)
in the formula ti=(si,pi,oi)、tj=(sj,pj,oj) A fact triplet representing the relationship p,
converting the entity classes on two sides of the relation according to the following formula to obtain the definition domain and value domain of the relation
Wherein Elsi、Eloi、Elsj、ElojRespectively representing entities si、oi、sj、ojThe sub-class to which the current packet belongs,respectively representing entities si、oi、sj、ojBelonging to a broad category.
In this embodiment, the optimized data balance grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then divides a relationship path relevant to the relationship into the same partition, thereby realizing data balance and independent grouping.
In this embodiment, a balanced grouping strategy is obtained by comprehensively considering the time complexity and the space complexity, that is, a highly relevant relationship is automatically found by a clustering algorithm, and then a relationship path (transaction) related to the relationship is divided into the same partition, so that data are truly grouped uniformly and independently. Relationships that share more common similar paths should be coupled more particularly, we start with clusters of relationships | R |, each cluster representing a relationship R ∈ R, each point within a cluster representing a relationship path associated with the relationship, and then iteratively compute the distance d between each cluster and the remaining clusters. It essentially measures the degree of overlap between two clusters, the greater the overlap, the higher the similarity. Relationships that share a large number of common similar paths will be partitioned into the same partition. The similarity of the relationship is measured by calculating the circle center distance d between two clusters, and the smaller d is, the higher the similarity is. As shown in particular in fig. 5.
The circle center distance d between two clusters in the two-dimensional space is calculated according to the following formula:
as shown in fig. 5, if d satisfies (a) in fig. 5 or (b) in fig. 5, the two relationships are considered to be similar to some extent, and the smaller d, the higher the similarity, otherwise discrete and independent. The specific steps of the clustering algorithm are as follows.
The basic process of excavating a frequent item set by the FP-growth algorithm is divided into two parts: constructing the FP tree and mining a frequent item set from the FP tree.
(1) Building FP Tree
When an existing FP-Growth algorithm constructs a frequent pattern tree, after any transaction executes an insertion operation, a transaction data set needs to be updated by adopting a sorting method, and the sorting is based on the specific position of an item of the transaction in a chain head table. In order to reduce the time complexity, an algorithm for optimizing and constructing a frequent pattern tree is provided. The storage structure used by the algorithm is defined as follows:
linkList={<rooti,blocki,itemSeti>},
bloacki={<itemij,{(frequencyItemijk,ancestorNodeijk)}>}
rooti=itemi1
itemSeti={itemi1,...,itemij}
the present embodiment is described by taking a transaction data set D as an example, and the detailed information of the data set is shown in table 4.
Table 4 transaction data D
Setting the minimum support degree of the data set as 3, and sorting the support degrees by adopting a principle from big to small to obtain a result: f:4, c:3, a:3, b:3, m:3 and p:3, and sorting the original data set according to the result in a decreasing support degree mode, wherein the result is shown in the rightmost column of the table 4. The SFP algorithm uses two data structures: a list of chain heads and a frequent pattern tree. The principle of optimizing the structure of the chain head table is given below, as shown in fig. 6.
The pseudo code of the specific steps for constructing the frequent pattern tree is as follows:
transaction T is traversed bar by bar starting at line 3 of codeiWill TiGo through from front to back, lines 4-7 make a judgment, according to item a1Determining whether a block using the item as a root node exists, if so, returning a block number, otherwise, adding block information using the item as the root node and returning (corresponding to ① in FIG. 5), and the 8 th line according to the block number and the item aiFirst, find if there is an entry that is the same as the entry and is the same as the ancestor node, add 1 to the entry count if there is, otherwise add the entry to the specified block (corresponding to ② in FIG. 5).
From the time complexity, assuming that each transaction in the transaction database contains k entries, the number of elements in the frequent entry set is m, and the total number of the transactions is n, the time complexity of inserting each transaction into the frequent pattern tree is O (m) according to the original entry header table structure2) The time complexity for constructing the whole frequent pattern tree is O (m)2N); and in the improved linked list structure, the time complexity of inserting each transaction into the frequent pattern tree is O (k), and the time complexity of constructing the whole frequent pattern tree is O (k × n). As shown in fig. 5, the left graph is the frequent pattern tree before modification, and the right graph is the frequent pattern tree after modification.
Before improving the frequent pattern tree, the time complexity of searching the child node is O (m), and the time complexity of the chain head table structure after the improvement is reduced to O (1).
Although recursive algorithms make the codes easier to understand and simpler, the overhead of time and space caused by the recursive algorithms makes the algorithms inefficient to execute, so that the mining efficiency of frequent item sets can be improved by reducing the recursive operations.
(2) Mining frequent item sets
Taking the frequent pattern tree constructed by the transaction database D in table 4 as an example, it is assumed that the conditional pattern base of the entry m is to be found. Firstly, finding the block number of an item m in a owned item set, and then searching all ancestor nodes of the item m in the corresponding block, wherein the ancestor nodes are the conditional mode base of the item m. As shown in FIG. 4, the conditional mode bases of m are < (f:2), (c:2), (a:2) > and < (f:1), (c:1), (a:1), (b:1) >, and the conditional mode bases of p are < (f:2), (c:2), (a:2), (m:2) > and < (c:1), (b:1) >. Then, taking the conditional mode base of each item as the input of the mapper stage of the item, creating a conditional FP tree, and mining the frequent item set of the item.
Further, the step S6 is specifically:
step S61, digging out strong association rules obtained through steps S3 and S4Definition domain and value domain El of sum relationid,Elir,rjdomainEljd,rirangeEljr,rzdomainElzd,rzrangeElzr;
Step S62, converting the strong association rule into Horn rule according to the following formula
Wherein Elid、ElirRespectively represent the relationship riDefinition and value ranges ofjd、EljrRepresents the relation rjDefinition and value ranges ofzd、ElzrRepresents the relation rzThe definition domain and the value domain.
In this embodiment, two strong association rules mined by the SFP algorithm are taken as an example, and the advantage of converting the strong association rules into the horns logic rules through the relationship type constraint is described in comparison.
it is easy to find that although a rational horns logical rule is generated by using the relationship type constraint in the expression (15), the directions of the relationships are not always consistent in many cases, and are similar to the expression (16) in many cases. Using relationship type constraints can specify directions for relationships because the connection entities sharing variables should belong to the same tag type, which makes the converted horns logical rules more complete.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.
Claims (8)
1. A knowledge base completion method based on a PRMATC algorithm is characterized by comprising the following steps:
step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database;
step S2, constructing and training a BILSTM-CRF model;
step S3, identifying and classifying entities on two sides of the relation through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relation;
step S4, optimizing data balance grouping and FP tree construction and excavation on the basis of the FP-Growth algorithm to obtain an improved FP-Growth algorithm;
step S5, digging out implicit strong association rules among the transactions according to the improved FP-Growth algorithm;
step S6, converting the definition domain and the strong association rule of the obtained relationship into a horns logic rule;
and step S7, acquiring new knowledge according to the acquired Horn logic rule, and adding the new knowledge to the knowledge base KB.
2. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the BILSTM-CRF model consists of two parts, namely a bidirectional LSTM and a CRF.
3. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 2, wherein: the bidirectional LSTM is composed of a forward LSTM and a backward LSTM;
the LSTM calculation process is realized by forgetting and memorizing information in the cell state, wherein the forgetting, memorizing and outputting are from a previous hidden layer state ht-1And current input XtThe specific calculation formula 4 is determined.
In the formula (4), Xt、Ct、ht、ft、it、OtRespectively corresponding to the input, cell state, hidden layer state, forgetting gate and output gate of the model at the moment t; the word vector is input to the BILSTM layer, and the output values are the predicted scores for each label corresponding to each word in a sentence, which are input to the CRF layer.
4. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 2, wherein: the CRF layer employs a linear conditional random field P (y | x) as shown below:
in formula (5) < lambda >kAnd mulIs a weight coefficient, tkAnd slFor the characteristic function, Z (x) is a normalization factor
And the output of the BILSTM layer is used as the input of a CRF layer, and a legal prediction label of each word is output after the CRF layer characteristic function operation and the normalization operation.
5. The method for complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein the step S3 specifically comprises:
step S31, each input triplet X is set to (X)1,x2,...xi,...xn) Obtaining all possible prediction sequences y ═ y (y) through a BILSTM layer and a CRF layer1,y2,...,yi,...yn);
The score S (X | y) for each predicted sequence y is shown as follows:
And step S32, calculating the maximum score y of the sequence as shown in the following formula:
y*=argmaxy∈YS(X|y)
step S33, converting through a relation type constraint conversion function, wherein the relation type constraint conversion function f is as follows:
f({t1,t2,...ti,...,tn})=(pd,p,pr)
in the formula ti=(si,pi,oi)、tj=(sj,pj,oj) A fact triplet representing the relationship p,
converting the entity classes on two sides of the relation according to the following formula to obtain the definition domain and value domain of the relation
6. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the optimized data balanced grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then divides a relationship path relevant to the relationship into the same partition, so that data balanced and independent grouping is realized.
7. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the step S4 specifically includes:
step S41; traversing transaction T on a case-by-case basisiWill TiThe process is traversed from the front to the back,
step S42: according to item a1Determining whether the partition with the item as the root node exists, if so, returning the partition number, and otherwise, adding the partition information with the item as the root node and returning;
step S44: according to block number and item aiFirstly, searching whether an item which is the same as the item and is the same as the ancestor node exists, if so, adding 1 to the item count, otherwise, adding the item to the specified block.
Step S41: finding the block number of the item m in the owned item set, then searching all ancestor nodes of the item m in the corresponding block, namely the conditional mode base of the item m
Step S45: the conditional pattern base of m is < (f:2), (c:2), (a:2) > and < (f:1), (c:1), (a:1), (b:1) >, and similarly the conditional pattern base of p is < (f:2), (c:2), (a:2), (m:2) > and < (c:1), (b:1) >, and the conditional pattern base of each item is taken as the input of the mapper stage of the item, creating a conditional FP tree, and mining the frequent item set of the item.
8. The method for complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein the step S6 specifically comprises:
step S61, digging out strong association rules obtained through steps S3 and S4Definition domain and value domain El of sum relationid,Elir,rjdomain Eljd,rirange Eljr,rzdomain Elzd,rzrange Elzr;
Step S62, converting the strong association rule into Horn rule according to the following formula
Wherein Elid、ElirIndividual watchShows the relation riDefinition and value ranges ofjd、EljrRepresents the relation rjDefinition and value ranges ofzd、ElzrRepresents the relation rzThe definition domain and the value domain.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911308709.7A CN111078896B (en) | 2019-12-18 | 2019-12-18 | Knowledge base completion method based on parallel rule mining algorithm PRMATC |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911308709.7A CN111078896B (en) | 2019-12-18 | 2019-12-18 | Knowledge base completion method based on parallel rule mining algorithm PRMATC |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111078896A true CN111078896A (en) | 2020-04-28 |
CN111078896B CN111078896B (en) | 2022-06-21 |
Family
ID=70315444
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911308709.7A Active CN111078896B (en) | 2019-12-18 | 2019-12-18 | Knowledge base completion method based on parallel rule mining algorithm PRMATC |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111078896B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048447A (en) * | 2022-06-27 | 2022-09-13 | 华中科技大学 | Database natural language interface system based on intelligent semantic completion |
CN115952361A (en) * | 2023-03-15 | 2023-04-11 | 中国科学院大学 | Dynamic recommendation system and method based on LSTM network and PPR algorithm |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190122111A1 (en) * | 2017-10-24 | 2019-04-25 | Nec Laboratories America, Inc. | Adaptive Convolutional Neural Knowledge Graph Learning System Leveraging Entity Descriptions |
CN110347847A (en) * | 2019-07-22 | 2019-10-18 | 西南交通大学 | Knowledge mapping complementing method neural network based |
-
2019
- 2019-12-18 CN CN201911308709.7A patent/CN111078896B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190122111A1 (en) * | 2017-10-24 | 2019-04-25 | Nec Laboratories America, Inc. | Adaptive Convolutional Neural Knowledge Graph Learning System Leveraging Entity Descriptions |
CN110347847A (en) * | 2019-07-22 | 2019-10-18 | 西南交通大学 | Knowledge mapping complementing method neural network based |
Non-Patent Citations (1)
Title |
---|
刘莉萍,章新友,牛晓录,郭永坤,丁亮: "基于Spark的并行关联规则挖掘算法研究综述", 《计算机工程与应用》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048447A (en) * | 2022-06-27 | 2022-09-13 | 华中科技大学 | Database natural language interface system based on intelligent semantic completion |
CN115048447B (en) * | 2022-06-27 | 2023-06-16 | 华中科技大学 | Database natural language interface system based on intelligent semantic completion |
CN115952361A (en) * | 2023-03-15 | 2023-04-11 | 中国科学院大学 | Dynamic recommendation system and method based on LSTM network and PPR algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN111078896B (en) | 2022-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhou et al. | A learned query rewrite system using monte carlo tree search | |
WO2022205833A1 (en) | Method and system for constructing and analyzing knowledge graph of wireless network protocol, and device and medium | |
CN110347847A (en) | Knowledge mapping complementing method neural network based | |
CN111611274A (en) | Database query optimization method and system | |
Halim et al. | On the efficient representation of datasets as graphs to mine maximal frequent itemsets | |
Wu et al. | Generalized association rule mining using an efficient data structure | |
CN111078896B (en) | Knowledge base completion method based on parallel rule mining algorithm PRMATC | |
CN104137095A (en) | System for evolutionary analytics | |
CN107656978B (en) | Function dependence-based diverse data restoration method | |
Gan et al. | Explainable fuzzy utility mining on sequences | |
CN116127084A (en) | Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method | |
CN109885694B (en) | Document selection and learning sequence determination method | |
Yang et al. | A novel evolutionary method to search interesting association rules by keywords | |
CN113361279A (en) | Medical entity alignment method and system based on double neighborhood map neural network | |
CN111444316B (en) | Knowledge graph question-answering-oriented compound question analysis method | |
CN116226404A (en) | Knowledge graph construction method and knowledge graph system for intestinal-brain axis | |
CN113836174B (en) | Asynchronous SQL (structured query language) connection query optimization method based on reinforcement learning DQN (direct-to-inverse) algorithm | |
CN114662012A (en) | Community query analysis method oriented to gene regulation network | |
Lin et al. | Efficient mining of high average-utility sequential patterns from uncertain databases | |
CN110991186A (en) | Entity analysis method based on probability soft logic model | |
CN112487015B (en) | Distributed RDF system based on incremental repartitioning and query optimization method thereof | |
CN113139657B (en) | Machine thinking realization method and device | |
Liu et al. | Clumppling: cluster matching and permutation program with integer linear programming | |
Cai et al. | An improved knowledge graph model based on fuzzy theory and TransR | |
Xu et al. | Joint Entity Relation Extraction based on Graph Neural Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |