CN111078896A - Knowledge base completion method based on PRMATC algorithm - Google Patents

Knowledge base completion method based on PRMATC algorithm Download PDF

Info

Publication number
CN111078896A
CN111078896A CN201911308709.7A CN201911308709A CN111078896A CN 111078896 A CN111078896 A CN 111078896A CN 201911308709 A CN201911308709 A CN 201911308709A CN 111078896 A CN111078896 A CN 111078896A
Authority
CN
China
Prior art keywords
item
knowledge base
algorithm
domain
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911308709.7A
Other languages
Chinese (zh)
Other versions
CN111078896B (en
Inventor
汪璟玢
张梨贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201911308709.7A priority Critical patent/CN111078896B/en
Publication of CN111078896A publication Critical patent/CN111078896A/en
Application granted granted Critical
Publication of CN111078896B publication Critical patent/CN111078896B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a knowledge base completion method based on a PRMATC algorithm, which comprises the following steps: step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database; step S2, constructing and training a BILSTM-CRF model; step S3, identifying and classifying entities on two sides of the relationship through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relationship, and step S4, improving an FP-Growth algorithm; step S5, digging out implicit strong association rules among the affairs; step S6, converting the definition domain of the obtained relation and the strong association rule into a Horn logic rule, and step S7, obtaining new knowledge according to the obtained Horn logic rule and adding the new knowledge to a knowledge base KB. The method can efficiently find the horns rule of the representative knowledge base, is better than other rule mining systems in terms of the quantity and accuracy of mining rules, and can better complement the knowledge base.

Description

Knowledge base completion method based on PRMATC algorithm
Technical Field
The invention relates to the field of mass data storage and reasoning under a knowledge graph, in particular to a knowledge base completion method based on a PRMATC algorithm.
Background
Mining the Horn rules from the large-scale semantic network knowledge base and further utilizing the rules to help deduce and add the knowledge lacking in the knowledge base is one of the most effective means for realizing the dynamic growth of the knowledge base. The association rule mining algorithm is one of important algorithms in the field of data mining, and aims to mine implicit relationships existing among transactions. Conventional algorithms include Apr i or i algorithm [1] and FP-Growth algorithm [2 ]. The traditional association rule mining algorithm has good effect on small-scale data concentration, but with the rapid development of internet technology in recent years, network data is increased explosively, and the traditional association rule mining algorithm has the problems that a single node cannot calculate, the running memory is insufficient and the like, so that the requirement of network big data cannot be met.
Disclosure of Invention
In view of this, the present invention provides a knowledge base completion method based on the PRMATC algorithm, which can efficiently dig a group of horns logic rules capable of representing semantic information of a knowledge base and better complete the knowledge base.
In order to achieve the purpose, the invention adopts the following technical scheme:
a knowledge base completion method based on a PRMATC algorithm comprises the following steps:
step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database;
step S2, constructing and training a BILSTM-CRF model;
step S3, identifying and classifying entities on two sides of the relation through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relation;
step S4, optimizing data balance grouping and FP tree construction and excavation on the basis of the FP-Growth algorithm to obtain an improved FP-Growth algorithm;
step S5, digging out implicit strong association rules among the transactions according to the improved FP-Growth algorithm;
step S6, converting the domain and the strong association rule into a horny logic rule according to the obtained relationship;
and step S7, acquiring new knowledge according to the acquired Horn logic rule, and adding the new knowledge to the knowledge base KB.
Further, the BILSTM-CRF model consists of two parts, namely a bidirectional LSTM and a CRF.
Further, the bidirectional LSTM is composed of a forward LSTM and a backward LSTM;
the LSTM calculation process is realized by forgetting and memorizing information in the cell state, wherein the forgetting, memorizing and outputting are from a previous hidden layer state ht-1And current input XtThe specific calculation formula 4 is determined.
Figure BDA0002323913680000021
In the formula (4), Xt、Ct、ht、ft、it、OtRespectively corresponding to the input, cell state, hidden layer state, forgetting gate and output gate of the model at the moment t; the word vector is input to the BILSTM layer, and the output values are the predicted scores for each label corresponding to each word in a sentence, which are input to the CRF layer.
Further, the CRF layer employs a linear conditional random field P (y | x) as shown in the following formula:
Figure BDA0002323913680000031
in formula (5) < lambda >kAnd mulIs a weight coefficient, tkAnd slFor the characteristic function, Z (x) is a normalization factor
Figure BDA0002323913680000032
The output of the BILSTM layer is used as the input of a CRF layer, and after the CRF layer characteristic function operation and normalization operation, the legal prediction label of each word is output.
Further, the step S3 is specifically:
step S31, each input triplet X is set to (X)1,x2,...xi,...xn) Obtaining all possible prediction sequences y ═ y (y) through a BILSTM layer and a CRF layer1,y2,...,yi,...yn);
The score S (X | y) for each predicted sequence y is shown as follows:
Figure BDA0002323913680000033
in the formula (7)
Figure BDA0002323913680000034
Output y for the ith positioniA is a transition probability matrix
And step S32, calculating the maximum score y of the sequence as shown in the following formula:
y*=argmaxy∈YS(X|y)
step S33, converting through a relation type constraint conversion function to obtain a definition domain and a value domain of each relation in the knowledge base, wherein the relation type constraint conversion function f is as follows:
f({t1,t2,...ti,...,tn})=(pd,p,pr)
in the formula ti=(si,pi,oi)、tj=(sj,pj,oj) A fact triplet representing the relationship p,
converting the entity classes on two sides of the relation according to the following formula to obtain the definition domain and value domain of the relation
siSubClassOf Elsi,oiSubClassOf Eloi,sjSubClassOf Elsj,ojSubClassOfEloj, Elsi,Eloi,Elsj,
Figure BDA0002323913680000035
Figure BDA0002323913680000041
Figure BDA0002323913680000042
Wherein Elsi、Eloi、Elsj、ElojRespectively representing entities si、oi、sj、ojThe sub-class to which the current packet belongs,
Figure BDA0002323913680000043
respectively representing entities si、oi、sj、ojBelonging to a broad category.
Further, the optimized data balance grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then, a relationship path relevant to the relationship is divided into the same partition, so that data balance and independent grouping are realized.
The step S4 specifically includes:
step S41; traversing transaction T on a case-by-case basisiWill TiThe process is traversed from the front to the back,
step S42: according to item a1Determining whether the partition with the item as the root node exists, if so, returning the partition number, and otherwise, adding the partition information with the item as the root node and returning;
step S44: according to block number and item aiFirst, find whether there is an item that is the same as the item and is the same as the ancestor node, if there is the item count plus 1, otherwise add the item to the specified block.
Step S41: finding the block number of the item m in the owned item set, then searching all ancestor nodes of the item m in the corresponding block, namely the conditional mode base of the item m
Step S45: the conditional pattern base of m is < (f:2), (c:2), (a:2) > and < (f:1), (c:1), (a:1), (b:1) >, and similarly the conditional pattern base of p is < (f:2), (c:2), (a:2), (m:2) > and < (c:1), (b:1) >, and the conditional pattern base of each item is taken as the input of the mapper stage of the item, creating a conditional FP tree, and mining the frequent item set of the item.
Further, the step S6 is specifically:
step S61, digging out strong association rules obtained through steps S3 and S4
Figure BDA0002323913680000051
Definition domain and value domain El of sum relationid,Elir,rjdomain Eljd,rirangeEljr,rzdomain Elzd,rzrange Elzr
Step S62, converting the strong association rule into Horn rule according to the following formula
Figure BDA0002323913680000052
Wherein Elid、ElirRespectively represent the relationship riDefinition and value ranges ofjd、EljrRepresents the relation rjDefinition and value ranges ofzd、ElzrRepresents the relation rzThe definition domain and the value domain.
Compared with the prior art, the invention has the following beneficial effects:
the method can efficiently find the horns rule of the representative knowledge base, is better than other rule mining systems in terms of the quantity and accuracy of mining rules, and can better complement the knowledge base.
Drawings
FIG. 1 is a flow chart of a method in one embodiment of the present invention;
FIG. 2 is an exemplary diagram illustrating implementation of knowledge base completion using horns logic rules in accordance with an embodiment of the present invention;
FIG. 3 is a block diagram of the PRMATC algorithm in accordance with one embodiment of the present invention;
FIG. 4 is a schematic diagram of the BILSTM-CRF model in accordance with an embodiment of the present invention;
FIG. 5 is a graph of inter-cluster overlap in an embodiment of the present invention;
FIG. 6 is a schematic diagram of an optimized chain head table structure according to an embodiment of the present invention
FIG. 7 is a modified frequent pattern tree in an embodiment of the invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
Referring to fig. 1, the present invention provides a method for complementing a knowledge base based on a PRMATC algorithm, which includes the following steps:
step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database;
step S2, constructing and training a BILSTM-CRF model;
step S3, identifying and classifying entities on two sides of the relation through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relation;
step S4, optimizing data balance grouping and FP tree construction and excavation on the basis of the FP-Growth algorithm to obtain an improved FP-Growth algorithm;
step S5, digging out implicit strong association rules among the transactions according to the improved FP-Growth algorithm;
step S6, converting the domain and the strong association rule into a horny logic rule according to the obtained relationship;
and step S7, acquiring new knowledge according to the acquired Horn logic rule, and adding the new knowledge to the knowledge base KB.
In the present embodiment, setting t ═ s, p, o > represents an example triplet. Where s denotes a Subject (Subject), p denotes a Predicate (Predicate), and o denotes an Object (Object). An RDF data graph is composed of a plurality of instance triples.
Is composed of a series ofThe directed graph formed by the interconnection of RDF instance triples is called RDF data graph rg, rg ═ t { (t)1,t2,...,ti,...,tn},
Figure BDA0002323913680000061
tiNode s ini,oiIs the vertex in the figure, piIs a directed edge in the graph, and the starting node of the directed edge is siThe termination node is oi
Given a triple ti(si,pi,oi) And tj(sj,pj,oj) If(s)i=sj&&oi≠oj) Or(s)i=oj&&oi≠sj) Or (o)i=sj&&si≠oj) Or (o)i=oj&&si≠sj) Then call tiAnd tjAdjacent, triple connections may be made.
Knowledge base KB ═ E, R, F, P, V >, where E denotes the set of entities, R denotes the set of relationships, F denotes the set of facts in the knowledge base, P denotes the set of properties, and V denotes the set of values.
Entity set E ═ E1, E2., en ═ subject (kb) ∪ ­ object (kb), which describes all entities in the semantic network knowledge base data layer, and corresponds to the set of instances in RDF.
A relationship set R ═ { R1, R2., rn } ═ relationship (kb), which represents relationships between entities.
Fact set
Figure BDA0002323913680000071
It represents the set of all instance triplets in the knowledge base.
The attribute set P represents the set P of global attributes, P { P1, P2.., pn }, which associates E with the attribute value V.
The attribute value set V represents a set V ═ V of overall attribute values1,v2,...,vnIt represents nodes such as text.
Let entity tag set EL ═ El1,El2,...,ElnIt represents a set of labels that can represent all entity classes in the knowledge base. For commonly used data sets such as YAGO and DBpedia, this embodiment expands PER, LOC and ORG, respectively, and defines 39 types as entity tag sets in this document, denoted as EL, where Cf ═ { PER | ORG | LOC } denotes a set of three major classes. As shown in table 1.
TABLE 1 entity tag set
Figure BDA0002323913680000072
Figure BDA0002323913680000081
In this embodiment, a more common sequence labeling mode BIO is adopted, where B denotes start (Begin), I denotes middle (Intermediate), and O denotes Other (Other) for labeling unrelated characters.
In this embodiment, the Redis distributed memory database cluster stores the definition domain and value domain of each relationship in the knowledge base and the horns logic rules mined by the algorithm. Specific tables and stored contents are shown in table 2.
TABLE 2 Redis table design and storage description
Figure BDA0002323913680000082
The BILSTM-CRF model in the embodiment is composed of two-way LSTM and CRF, wherein the two-way LSTM is composed of forward LSTM and backward LSTM;
the LSTM calculation process is realized by forgetting and memorizing information in the cell state, wherein the forgetting, memorizing and outputting are from a previous hidden layer state ht-1And current input XtThe specific calculation formula 1 is shown.
ft=σ(Wf·[ht-1,xt]+bf)
it=σ(Wi·[ht-1,xt]+bi)
Figure BDA0002323913680000091
Figure BDA0002323913680000092
Ot=σ(Wo·[ht-1,xt]+bo)
ht=Ot*tanh(Ct)
In the formula, Xt、Ct、ht、ft、it、OtRespectively corresponding to the input, cell state, hidden layer state, forgetting gate and output gate of the model at the moment t; the word vector is input to the BILSTM layer, and the output value is the predicted score for each label corresponding to each word in a sentence, which is input to the CRF layer. The embodiment employs the BIO tagging mode, so each word corresponds to 79 tag scores.
The bidirectional LSTM can effectively combine the context of words and words, and can better identify entities and predict type labels corresponding to the entities. For example, we encode "yaoming nationality china", and input "yaoming", "nationality" and "china" sequentially from the forward LSTM to obtain three vectors hl0, hl1 and hl2, and input "china", "nationality" and "yaoming" sequentially from the LSTM to obtain three vectors hr0, hr1 and hr2, respectively, and the last vector is obtained by splicing the forward vector and the backward vector, so each word vector contains richer corpus information, and the entity recognition accuracy is higher.
CRF layer: conditional Random Field (CRF) [9] is a conditional probability distribution model for a given set of input sequences for another set of output sequences. It can be easily found that even if no CRF layer is provided, named entity recognition and prediction can be completed only through the BILSTM model, because the output of the BILSTM layer is the prediction score of each label corresponding to each word, and the label with the highest score of each word can be selected to be combined into the best prediction label. However, in many cases the highest scoring sequence is not legal, e.g., "B-PER I-PER" is valid, but "B-PER I-ORG" is not, the role of the CRF layer may add some constraints to the last predicted tag to guarantee the validity of the predicted tag. For named entity recognition sequence tagging problems, linear conditional random fields (linear-CRF) are typically employed.
The linear conditional random field P (y | x), is given by:
Figure BDA0002323913680000093
in the formula ofkAnd mulIs a weight coefficient, tkAnd slFor the characteristic function, Z (x) is a normalization factor
Figure BDA0002323913680000094
The output of the BILSTM layer is used as the input of a CRF layer, and after the CRF layer characteristic function operation and normalization operation, the legal prediction label of each word is output.
In this embodiment, the prediction can be performed after the model training is completed. Each RDF triplet < s, p, o > in the knowledge base is taken as input, e.g. "yaoming china". And (4) according to the predicted time, obtaining all possible predicted sequence scores in the input sentence according to the trained model parameters, and taking the maximum value. The step S3 specifically includes:
step S31, each input triplet X is set to (X)1,x2,...xi,...xn) Obtaining all possible prediction sequences y ═ y (y) through a BILSTM layer and a CRF layer1,y2,...,yi,...yn);
The score S (X | y) for each predicted sequence y is shown as follows:
Figure BDA0002323913680000101
in the formula
Figure BDA0002323913680000102
Output y for the ith positioniA is a transition probability matrix
And step S32, calculating the maximum score y of the sequence as shown in the following formula:
y*=argmaxy∈YS(X|y)
step S33, converting through a relation type constraint conversion function to obtain a definition domain and a value domain of each relation in the knowledge base, wherein the relation type constraint conversion function f is as follows:
f({t1,t2,...ti,...,tn})=(pd,p,pr)
in the formula ti=(si,pi,oi)、tj=(sj,pj,oj) A fact triplet representing the relationship p,
converting the entity classes on two sides of the relation according to the following formula to obtain the definition domain and value domain of the relation
siSubClassOf Elsi,oiSubClassOf Eloi,sjSubClassOf Elsj,ojSubClassOfEloj, Elsi,Eloi,Elsj,
Figure BDA0002323913680000103
Figure BDA0002323913680000104
Figure BDA0002323913680000105
Wherein Elsi、Eloi、Elsj、ElojRespectively representing entities si、oi、sj、ojThe sub-class to which the current packet belongs,
Figure BDA0002323913680000106
respectively representing entities si、oi、sj、ojBelonging to a broad category.
In this embodiment, the optimized data balance grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then divides a relationship path relevant to the relationship into the same partition, thereby realizing data balance and independent grouping.
In this embodiment, a balanced grouping strategy is obtained by comprehensively considering the time complexity and the space complexity, that is, a highly relevant relationship is automatically found by a clustering algorithm, and then a relationship path (transaction) related to the relationship is divided into the same partition, so that data are truly grouped uniformly and independently. Relationships that share more common similar paths should be coupled more particularly, we start with clusters of relationships | R |, each cluster representing a relationship R ∈ R, each point within a cluster representing a relationship path associated with the relationship, and then iteratively compute the distance d between each cluster and the remaining clusters. It essentially measures the degree of overlap between two clusters, the greater the overlap, the higher the similarity. Relationships that share a large number of common similar paths will be partitioned into the same partition. The similarity of the relationship is measured by calculating the circle center distance d between two clusters, and the smaller d is, the higher the similarity is. As shown in particular in fig. 5.
The circle center distance d between two clusters in the two-dimensional space is calculated according to the following formula:
Figure BDA0002323913680000111
as shown in fig. 5, if d satisfies (a) in fig. 5 or (b) in fig. 5, the two relationships are considered to be similar to some extent, and the smaller d, the higher the similarity, otherwise discrete and independent. The specific steps of the clustering algorithm are as follows.
Figure BDA0002323913680000112
Line 3 converts each relationship path into a 100-dimensional vector through word2 Vec; in lines 4-5, the dimension of the high-dimensional data is reduced by TSNE, and the dimension is 2; the 6 th row returns each relationship r and the corresponding relationship path set p; lines 7-11 first perform outlier detection on the relationship path set of each relationship, and then determine the circle center coordinate O and radius m of the cluster represented by the relationship through the distance function.
The basic process of excavating a frequent item set by the FP-growth algorithm is divided into two parts: constructing the FP tree and mining a frequent item set from the FP tree.
(1) Building FP Tree
When an existing FP-Growth algorithm constructs a frequent pattern tree, after any transaction executes an insertion operation, a transaction data set needs to be updated by adopting a sorting method, and the sorting is based on the specific position of an item of the transaction in a chain head table. In order to reduce the time complexity, an algorithm for optimizing and constructing a frequent pattern tree is provided. The storage structure used by the algorithm is defined as follows:
linkList={<rooti,blocki,itemSeti>},
bloacki={<itemij,{(frequencyItemijk,ancestorNodeijk)}>}
rooti=itemi1
itemSeti={itemi1,...,itemij}
the present embodiment is described by taking a transaction data set D as an example, and the detailed information of the data set is shown in table 4.
Table 4 transaction data D
Figure BDA0002323913680000121
Setting the minimum support degree of the data set as 3, and sorting the support degrees by adopting a principle from big to small to obtain a result: f:4, c:3, a:3, b:3, m:3 and p:3, and sorting the original data set according to the result in a decreasing support degree mode, wherein the result is shown in the rightmost column of the table 4. The SFP algorithm uses two data structures: a list of chain heads and a frequent pattern tree. The principle of optimizing the structure of the chain head table is given below, as shown in fig. 6.
The pseudo code of the specific steps for constructing the frequent pattern tree is as follows:
Figure BDA0002323913680000131
transaction T is traversed bar by bar starting at line 3 of codeiWill TiGo through from front to back, lines 4-7 make a judgment, according to item a1Determining whether a block using the item as a root node exists, if so, returning a block number, otherwise, adding block information using the item as the root node and returning (corresponding to ① in FIG. 5), and the 8 th line according to the block number and the item aiFirst, find if there is an entry that is the same as the entry and is the same as the ancestor node, add 1 to the entry count if there is, otherwise add the entry to the specified block (corresponding to ② in FIG. 5).
From the time complexity, assuming that each transaction in the transaction database contains k entries, the number of elements in the frequent entry set is m, and the total number of the transactions is n, the time complexity of inserting each transaction into the frequent pattern tree is O (m) according to the original entry header table structure2) The time complexity for constructing the whole frequent pattern tree is O (m)2N); and in the improved linked list structure, the time complexity of inserting each transaction into the frequent pattern tree is O (k), and the time complexity of constructing the whole frequent pattern tree is O (k × n). As shown in fig. 5, the left graph is the frequent pattern tree before modification, and the right graph is the frequent pattern tree after modification.
Before improving the frequent pattern tree, the time complexity of searching the child node is O (m), and the time complexity of the chain head table structure after the improvement is reduced to O (1).
Although recursive algorithms make the codes easier to understand and simpler, the overhead of time and space caused by the recursive algorithms makes the algorithms inefficient to execute, so that the mining efficiency of frequent item sets can be improved by reducing the recursive operations.
(2) Mining frequent item sets
Taking the frequent pattern tree constructed by the transaction database D in table 4 as an example, it is assumed that the conditional pattern base of the entry m is to be found. Firstly, finding the block number of an item m in a owned item set, and then searching all ancestor nodes of the item m in the corresponding block, wherein the ancestor nodes are the conditional mode base of the item m. As shown in FIG. 4, the conditional mode bases of m are < (f:2), (c:2), (a:2) > and < (f:1), (c:1), (a:1), (b:1) >, and the conditional mode bases of p are < (f:2), (c:2), (a:2), (m:2) > and < (c:1), (b:1) >. Then, taking the conditional mode base of each item as the input of the mapper stage of the item, creating a conditional FP tree, and mining the frequent item set of the item.
Further, the step S6 is specifically:
step S61, digging out strong association rules obtained through steps S3 and S4
Figure BDA0002323913680000141
Definition domain and value domain El of sum relationid,Elir,rjdomainEljd,rirangeEljr,rzdomainElzd,rzrangeElzr
Step S62, converting the strong association rule into Horn rule according to the following formula
Figure BDA0002323913680000142
Wherein Elid、ElirRespectively represent the relationship riDefinition and value ranges ofjd、EljrRepresents the relation rjDefinition and value ranges ofzd、ElzrRepresents the relation rzThe definition domain and the value domain.
In this embodiment, two strong association rules mined by the SFP algorithm are taken as an example, and the advantage of converting the strong association rules into the horns logic rules through the relationship type constraint is described in comparison.
(1)
Figure BDA0002323913680000143
The generated Horn logic rule is as follows: .
Figure BDA0002323913680000144
(2)
Figure BDA0002323913680000145
The generated Horn logic rules are shown below:
Figure BDA0002323913680000146
it is easy to find that although a rational horns logical rule is generated by using the relationship type constraint in the expression (15), the directions of the relationships are not always consistent in many cases, and are similar to the expression (16) in many cases. Using relationship type constraints can specify directions for relationships because the connection entities sharing variables should belong to the same tag type, which makes the converted horns logical rules more complete.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims (8)

1. A knowledge base completion method based on a PRMATC algorithm is characterized by comprising the following steps:
step S1, importing and storing all fact triples and entities in a large-scale semantic network knowledge base KB into a distributed cluster Neo4j database;
step S2, constructing and training a BILSTM-CRF model;
step S3, identifying and classifying entities on two sides of the relation through a trained BILSTM-CRF model, and converting to obtain a definition domain and a value domain of the relation;
step S4, optimizing data balance grouping and FP tree construction and excavation on the basis of the FP-Growth algorithm to obtain an improved FP-Growth algorithm;
step S5, digging out implicit strong association rules among the transactions according to the improved FP-Growth algorithm;
step S6, converting the definition domain and the strong association rule of the obtained relationship into a horns logic rule;
and step S7, acquiring new knowledge according to the acquired Horn logic rule, and adding the new knowledge to the knowledge base KB.
2. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the BILSTM-CRF model consists of two parts, namely a bidirectional LSTM and a CRF.
3. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 2, wherein: the bidirectional LSTM is composed of a forward LSTM and a backward LSTM;
the LSTM calculation process is realized by forgetting and memorizing information in the cell state, wherein the forgetting, memorizing and outputting are from a previous hidden layer state ht-1And current input XtThe specific calculation formula 4 is determined.
Figure FDA0002323913670000021
In the formula (4), Xt、Ct、ht、ft、it、OtRespectively corresponding to the input, cell state, hidden layer state, forgetting gate and output gate of the model at the moment t; the word vector is input to the BILSTM layer, and the output values are the predicted scores for each label corresponding to each word in a sentence, which are input to the CRF layer.
4. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 2, wherein: the CRF layer employs a linear conditional random field P (y | x) as shown below:
Figure FDA0002323913670000022
in formula (5) < lambda >kAnd mulIs a weight coefficient, tkAnd slFor the characteristic function, Z (x) is a normalization factor
Figure FDA0002323913670000023
And the output of the BILSTM layer is used as the input of a CRF layer, and a legal prediction label of each word is output after the CRF layer characteristic function operation and the normalization operation.
5. The method for complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein the step S3 specifically comprises:
step S31, each input triplet X is set to (X)1,x2,...xi,...xn) Obtaining all possible prediction sequences y ═ y (y) through a BILSTM layer and a CRF layer1,y2,...,yi,...yn);
The score S (X | y) for each predicted sequence y is shown as follows:
Figure FDA0002323913670000024
in the formula (7)
Figure FDA0002323913670000031
Output y for the ith positioniA is a transition probability matrix
And step S32, calculating the maximum score y of the sequence as shown in the following formula:
y*=argmaxy∈YS(X|y)
step S33, converting through a relation type constraint conversion function, wherein the relation type constraint conversion function f is as follows:
f({t1,t2,...ti,...,tn})=(pd,p,pr)
in the formula ti=(si,pi,oi)、tj=(sj,pj,oj) A fact triplet representing the relationship p,
converting the entity classes on two sides of the relation according to the following formula to obtain the definition domain and value domain of the relation
siSubClassOf Elsi,oiSubClassOf Eloi,sjSubClassOf Elsj,ojSubClassOf Eloj,Elsi,Eloi,Elsj,
Figure FDA0002323913670000032
Figure FDA0002323913670000033
Figure FDA0002323913670000034
Wherein Elsi、Eloi、Elsj、ElojRespectively representing entities si、oi、sj、ojThe sub-class to which the current packet belongs,
Figure FDA0002323913670000035
respectively representing entities si、oi、sj、ojBelonging to a broad category.
6. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the optimized data balanced grouping automatically discovers a highly relevant relationship through a clustering algorithm, and then divides a relationship path relevant to the relationship into the same partition, so that data balanced and independent grouping is realized.
7. The method of complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein: the step S4 specifically includes:
step S41; traversing transaction T on a case-by-case basisiWill TiThe process is traversed from the front to the back,
step S42: according to item a1Determining whether the partition with the item as the root node exists, if so, returning the partition number, and otherwise, adding the partition information with the item as the root node and returning;
step S44: according to block number and item aiFirstly, searching whether an item which is the same as the item and is the same as the ancestor node exists, if so, adding 1 to the item count, otherwise, adding the item to the specified block.
Step S41: finding the block number of the item m in the owned item set, then searching all ancestor nodes of the item m in the corresponding block, namely the conditional mode base of the item m
Step S45: the conditional pattern base of m is < (f:2), (c:2), (a:2) > and < (f:1), (c:1), (a:1), (b:1) >, and similarly the conditional pattern base of p is < (f:2), (c:2), (a:2), (m:2) > and < (c:1), (b:1) >, and the conditional pattern base of each item is taken as the input of the mapper stage of the item, creating a conditional FP tree, and mining the frequent item set of the item.
8. The method for complementing a knowledge base based on the PRMATC algorithm according to claim 1, wherein the step S6 specifically comprises:
step S61, digging out strong association rules obtained through steps S3 and S4
Figure FDA0002323913670000041
Definition domain and value domain El of sum relationid,Elir,rjdomain Eljd,rirange Eljr,rzdomain Elzd,rzrange Elzr
Step S62, converting the strong association rule into Horn rule according to the following formula
Figure FDA0002323913670000042
Wherein Elid、ElirIndividual watchShows the relation riDefinition and value ranges ofjd、EljrRepresents the relation rjDefinition and value ranges ofzd、ElzrRepresents the relation rzThe definition domain and the value domain.
CN201911308709.7A 2019-12-18 2019-12-18 Knowledge base completion method based on parallel rule mining algorithm PRMATC Active CN111078896B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911308709.7A CN111078896B (en) 2019-12-18 2019-12-18 Knowledge base completion method based on parallel rule mining algorithm PRMATC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911308709.7A CN111078896B (en) 2019-12-18 2019-12-18 Knowledge base completion method based on parallel rule mining algorithm PRMATC

Publications (2)

Publication Number Publication Date
CN111078896A true CN111078896A (en) 2020-04-28
CN111078896B CN111078896B (en) 2022-06-21

Family

ID=70315444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911308709.7A Active CN111078896B (en) 2019-12-18 2019-12-18 Knowledge base completion method based on parallel rule mining algorithm PRMATC

Country Status (1)

Country Link
CN (1) CN111078896B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048447A (en) * 2022-06-27 2022-09-13 华中科技大学 Database natural language interface system based on intelligent semantic completion
CN115952361A (en) * 2023-03-15 2023-04-11 中国科学院大学 Dynamic recommendation system and method based on LSTM network and PPR algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190122111A1 (en) * 2017-10-24 2019-04-25 Nec Laboratories America, Inc. Adaptive Convolutional Neural Knowledge Graph Learning System Leveraging Entity Descriptions
CN110347847A (en) * 2019-07-22 2019-10-18 西南交通大学 Knowledge mapping complementing method neural network based

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190122111A1 (en) * 2017-10-24 2019-04-25 Nec Laboratories America, Inc. Adaptive Convolutional Neural Knowledge Graph Learning System Leveraging Entity Descriptions
CN110347847A (en) * 2019-07-22 2019-10-18 西南交通大学 Knowledge mapping complementing method neural network based

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘莉萍,章新友,牛晓录,郭永坤,丁亮: "基于Spark的并行关联规则挖掘算法研究综述", 《计算机工程与应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048447A (en) * 2022-06-27 2022-09-13 华中科技大学 Database natural language interface system based on intelligent semantic completion
CN115048447B (en) * 2022-06-27 2023-06-16 华中科技大学 Database natural language interface system based on intelligent semantic completion
CN115952361A (en) * 2023-03-15 2023-04-11 中国科学院大学 Dynamic recommendation system and method based on LSTM network and PPR algorithm

Also Published As

Publication number Publication date
CN111078896B (en) 2022-06-21

Similar Documents

Publication Publication Date Title
Zhou et al. A learned query rewrite system using monte carlo tree search
WO2022205833A1 (en) Method and system for constructing and analyzing knowledge graph of wireless network protocol, and device and medium
CN110347847A (en) Knowledge mapping complementing method neural network based
CN111611274A (en) Database query optimization method and system
Halim et al. On the efficient representation of datasets as graphs to mine maximal frequent itemsets
Wu et al. Generalized association rule mining using an efficient data structure
CN111078896B (en) Knowledge base completion method based on parallel rule mining algorithm PRMATC
CN104137095A (en) System for evolutionary analytics
CN107656978B (en) Function dependence-based diverse data restoration method
Gan et al. Explainable fuzzy utility mining on sequences
CN116127084A (en) Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method
CN109885694B (en) Document selection and learning sequence determination method
Yang et al. A novel evolutionary method to search interesting association rules by keywords
CN113361279A (en) Medical entity alignment method and system based on double neighborhood map neural network
CN111444316B (en) Knowledge graph question-answering-oriented compound question analysis method
CN116226404A (en) Knowledge graph construction method and knowledge graph system for intestinal-brain axis
CN113836174B (en) Asynchronous SQL (structured query language) connection query optimization method based on reinforcement learning DQN (direct-to-inverse) algorithm
CN114662012A (en) Community query analysis method oriented to gene regulation network
Lin et al. Efficient mining of high average-utility sequential patterns from uncertain databases
CN110991186A (en) Entity analysis method based on probability soft logic model
CN112487015B (en) Distributed RDF system based on incremental repartitioning and query optimization method thereof
CN113139657B (en) Machine thinking realization method and device
Liu et al. Clumppling: cluster matching and permutation program with integer linear programming
Cai et al. An improved knowledge graph model based on fuzzy theory and TransR
Xu et al. Joint Entity Relation Extraction based on Graph Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant