CN106874380B - Method and device for checking triple of knowledge base - Google Patents

Method and device for checking triple of knowledge base Download PDF

Info

Publication number
CN106874380B
CN106874380B CN201710011368.1A CN201710011368A CN106874380B CN 106874380 B CN106874380 B CN 106874380B CN 201710011368 A CN201710011368 A CN 201710011368A CN 106874380 B CN106874380 B CN 106874380B
Authority
CN
China
Prior art keywords
probability distribution
triple
factor function
distribution
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710011368.1A
Other languages
Chinese (zh)
Other versions
CN106874380A (en
Inventor
赵伟华
张日崇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Aeronautics and Astronautics
Original Assignee
Beijing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Aeronautics and Astronautics filed Critical Beijing University of Aeronautics and Astronautics
Priority to CN201710011368.1A priority Critical patent/CN106874380B/en
Publication of CN106874380A publication Critical patent/CN106874380A/en
Application granted granted Critical
Publication of CN106874380B publication Critical patent/CN106874380B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for inspecting a knowledge base triple, which can determine whether an extended triple is put into a knowledge base or not by acquiring a rule corresponding to the extended triple, determining a factor function corresponding to the rule according to an initial factor function and an EM algorithm and determining whether the extended triple is credible or not according to the factor function, thereby expanding the knowledge base and improving the accuracy of the expansion of the knowledge base.

Description

Method and device for checking triple of knowledge base
Technical Field
The invention relates to a knowledge base expansion technology, in particular to a method and a device for checking knowledge base triples.
Background
The knowledge base is a database for structurally storing knowledge in a form of triples and is used for structurally storing massive knowledge in a certain field or a certain industry. For example, a historical knowledge base may store a vast amount of knowledge in the historical domain, including individual historical characters, historical events, and the like. The knowledge base mainly describes objects by taking examples as main description objects, and expresses knowledge by adopting an object-oriented method, wherein one example is a reference to a concrete or abstract transaction in reality. For example, an instance may represent a person, may represent a city, a thing, etc.
A repository typically includes multiple instances, and multiple attributes of the instances and relationships between the instances are stored in a structure of triples. Triplets are the infrastructure in the knowledge base used to represent knowledge. Its structure can be expressed as < first statement, relational statement, second statement >, relational statement is used to express the relation between the first statement and the second statement.
The knowledge base expansion means that under the condition that an original knowledge base is incomplete, an unknown triple is predicted by using a known triple representing knowledge through a data mining method, so that a new triple is expanded in the original knowledge base, and the knowledge base is more complete. Therefore, checking whether the new triplet is credible becomes a technical problem to be solved urgently.
Disclosure of Invention
The invention provides a method and a device for inspecting a triple of a knowledge base, which aim to overcome the defects of unreliable expanded triples and the like in the prior art.
The invention provides a method for checking knowledge base triples, which comprises the following steps:
acquiring a rule corresponding to an extended triple, wherein the extended triple is a triple obtained by performing extension operation based on an original triple in an existing knowledge base and the rule, the extended triple comprises an ordered set at least comprising a first statement, a relational statement and a second statement, and the relational statement is used for representing the relationship between the first statement and the second statement;
determining a factor function corresponding to the rule, wherein the factor function is used for representing the probability of whether the rule is correct or not, and the factor function is obtained according to an initial factor function and an EM algorithm;
and determining whether the extension triple is credible according to the factor function.
According to the method as described above, optionally, the determining whether the extension triplet is trusted according to the factor function includes:
determining, according to belief propagation and the factor function, a first probability distribution and a second probability distribution corresponding to the extended triplets, the first probability distribution representing a probability that the extended triplets should be trustworthy, the second probability distribution representing a probability that the extended triplets should be trustworthy, and the second probability distribution being 1 — the first probability distribution;
and determining whether the extension triple is credible according to a target probability distribution and a preset threshold, wherein the target probability distribution is the first probability distribution or the second probability distribution.
According to the method as described above, optionally, the determining whether the extension triplet is trusted according to the target probability distribution and the preset threshold includes:
if the preset threshold is a credible threshold, the target probability distribution is a first probability distribution, and if the target probability distribution is greater than or equal to the preset threshold, the credibility of the extension triple is determined; if the target probability distribution is smaller than the preset threshold, determining that the extension triple is not credible;
if the preset threshold is an untrusted threshold, the target probability distribution is a second probability distribution, and if the target probability distribution is greater than or equal to the preset threshold, the extension triple is determined to be untrusted; and if the target probability distribution is smaller than the preset threshold, determining that the extension triple is credible.
According to the method as described above, optionally, the determining a factor function corresponding to the rule includes:
determining the factor function f (t +1) after iterative operation by the EM algorithm according to the formula:
f(t+1)=f(t)*[f’(t)/p(t)];
wherein f (t) represents the value of the factor function in the t-th round, t is a positive integer greater than or equal to 0, the initial value of t is 0, f (0) is the value of the initialized factor function, f' (t) represents the empirical distribution of the factor function in the t-th round, p (t) represents the sampling distribution of the factor function in the t-th round, and the empirical distribution and the sampling distribution are obtained in the iterative operation process of the EM algorithm.
According to the method described above, optionally, the iterative operation is stopped when the value of f (t) no longer changes.
Another aspect of the present invention provides an apparatus for triple verification of a knowledge base, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a rule corresponding to an extended triple, the extended triple is a triple obtained by performing extension operation based on an original triple in an existing knowledge base and the rule, the extended triple comprises an ordered set at least comprising a first statement, a relation statement and a second statement, and the relation statement is used for representing the relation between the first statement and the second statement;
the determining module is used for determining a factor function corresponding to the rule, the factor function is used for representing the probability of whether the rule is correct, and the factor function is obtained according to an initial factor function and an EM algorithm;
and the processing module is used for determining whether the extension triple is credible according to the factor function.
According to the apparatus as described above, optionally, the processing module includes:
a first submodule, configured to determine, according to belief propagation and the factor function, a first probability distribution and a second probability distribution corresponding to the extended triplet, where the first probability distribution is used to represent a probability that the extended triplet should be trusted, the second probability distribution is used to represent a probability that the extended triplet should be trusted, and the second probability distribution is 1 — the first probability distribution;
and the second submodule is used for determining whether the extension triple is credible according to a target probability distribution and a preset threshold, wherein the target probability distribution is the first probability distribution or the second probability distribution.
According to the apparatus as described above, optionally the second sub-module is specifically configured to:
if the preset threshold is a credible threshold, the target probability distribution is a first probability distribution, and if the target probability distribution is greater than or equal to the preset threshold, the credibility of the extension triple is determined; if the target probability distribution is smaller than the preset threshold, determining that the extension triple is not credible;
if the preset threshold is an untrusted threshold, the target probability distribution is a second probability distribution, and if the target probability distribution is greater than or equal to the preset threshold, the extension triple is determined to be untrusted; and if the target probability distribution is smaller than the preset threshold, determining that the extension triple is credible.
According to the apparatus as described above, optionally, the determining module is specifically configured to:
determining the factor function f (t +1) after iterative operation by the EM algorithm according to the formula:
f(t+1)=f(t)*[f’(t)/p(t)];
wherein f (t) represents the value of the factor function in the t-th round, t is a positive integer greater than or equal to 0, the initial value of t is 0, f (0) is the value of the initialized factor function, f' (t) represents the empirical distribution of the factor function in the t-th round, p (t) represents the sampling distribution of the factor function in the t-th round, and the empirical distribution and the sampling distribution are obtained in the iterative operation process of the EM algorithm.
According to the apparatus as described above, optionally, the determining module is further configured to:
the iterative operation stops when the value of f (t) no longer changes.
According to the method and the device for inspecting the triple of the knowledge base, the rule corresponding to the expansion triple is obtained, the factor function corresponding to the rule is determined according to the initial factor function and the EM algorithm, whether the expansion triple is credible or not is determined according to the factor function, whether the expansion triple is put into the knowledge base or not can be further determined, the knowledge base is expanded, and the accuracy of the expansion of the knowledge base is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for triple inspection of a knowledge base according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a method for triple verification of a knowledge base according to another embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an apparatus for triple verification of a knowledge base according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an apparatus for triple verification of a knowledge base according to another embodiment of the present invention;
FIG. 5 is a factor graph constructed in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment provides a method for checking the triple of the knowledge base, which is used for checking whether the extension triple of the knowledge base is credible. The execution subject of the present embodiment is a means of knowledge base triple verification.
As shown in fig. 1, a schematic flow chart of a method for triple inspection of a knowledge base is shown, and the method includes:
step 101, obtaining a rule corresponding to an extended triple, where the extended triple is a triple obtained by performing an extension operation based on an original triple and the rule in an existing knowledge base, and the extended triple includes an ordered set composed of at least a first statement, a relational statement, and a second statement, and the relational statement is used to represent a relationship between the first statement and the second statement.
The knowledge base is composed of a plurality of triples representing knowledge, such as a Freebase knowledge base, and one triplet may be represented as (first sentence, relational sentence, second sentence), where the relational sentence is used to represent the relationship between the first sentence and the second sentence, for example, one triplet in the knowledge base is (lyming, nationality, china) which represents that the nationality of the lyming is china, and the other triplet is (lyming, residence, beijing) which represents that the lyming lives in beijing. The knowledge base is expanded, namely, according to the original triples in the knowledge base, some rules are discovered by adopting a rule discovery method, and then expansion operation is carried out according to the original triples and the rules to obtain the expanded triples.
The rule discovery method may be a method commonly used in the prior art, such as an association rule discovery method AMIE or other discovery methods. There are many ways to obtain the extension triples according to the rule. For example, there are multiple sets of original triples in the existing knowledge base as follows: (A, daughter, B), (C, husband, B), (A, daughter, C), where (A, daughter, B) denotes that A is the daughter of B, (C, husband, B) denotes that C is the husband of B, (A, daughter, C) denotes that A is the daughter of C, a rule can be found that (H, daughter, Y) and (Z, husband, Y) can infer (H, daughter, Z), which is expressed as (H, daughter, Y) + (Z, husband, Y) > (H, daughter, Z), which is present if there are (minired, King, Ying) and (Zhang, husband, King) in the knowledge base, but there is no knowledge of minired and Zhang, according to rule (H, English, Y) + (Z, husband, Y) > (H, daughter, Z), and two already existing minired and original, daughter, wang ying) and (zhang san, husband, wang ying), an extended triplet (pink, daughter, zhang san) can be derived. At this time, the rule corresponding to the extended triplet is (H, daughter, Y) + (Z, husband, Y) > (H, daughter, Z).
It is understood that the first statement and the third statement in each triple in the rule are not determined, but are unknown as variables in the equation, and we call each triple that constitutes the rule as an atomic rule, and the atomic rule is an unknown triple, and the original triple or the extended triple in the knowledge base can be substituted into the rule to obtain an instance of the rule, which satisfies the rule.
According to the process, a plurality of rules are obtained on the basis of the original triples in the knowledge base, and a plurality of extended triples are obtained by performing extension operation according to the plurality of rules and the original triples.
It can be understood that after the plurality of extended triples are obtained by performing the extension operation according to the plurality of rules and the original triples, the extension operation may be continued according to the plurality of rules, the original triples and the plurality of extended triples until no new extended triples are obtained.
All the original triples or the extended triples which can meet the multiple rules in the knowledge base are substituted into the rules to obtain multiple examples of the multiple rules, and it can be understood that one rule can have multiple examples.
If there is a certain extended triple in an instance of a rule, the rule corresponding to the extended triple is the rule.
It can be understood that, in the actual operation process, all rules are used, and our purpose is to check whether the extension triple is trusted, so that the purpose here is only to achieve, and therefore, the rule corresponding to the extension triple is obtained.
And 102, determining a factor function corresponding to the rule, wherein the factor function is used for indicating the probability of whether the rule is correct or not, and the factor function is obtained according to the initial factor function and the EM algorithm.
For each rule, it is understood that the rule may be composed of at least two atomic rules, where we refer to the number of atomic rules composing a rule as the length of the rule, and a factor function is used to indicate the probability of whether a rule is correct.
And the factor function corresponding to a rule is obtained by adopting an EM algorithm according to the initial factor function based on the example of the rule and the triplet (which may comprise the original triplet and the extended triplet) involved in the example.
It should be noted that, for convenience of description, in actual operation, each instance of a rule is represented by a factor function, that is, each instance has a factor function, and a rule may correspond to multiple instances, and then a rule may correspond to multiple factor functions in a calculation process, and we refer to multiple factor functions corresponding to a rule as factor functions belonging to the same family, and the results of the factor functions of the same family in the same calculation step are the same, and may indicate the probability whether the rule is correct or not. Therefore, we can refer to herein as determining the factor function to which the rule corresponds, i.e., determining the factor function of the instance to which the rule corresponds.
The EM Algorithm is a maximum Expectation Algorithm (Expectation Maximization Algorithm), which is an iterative Algorithm, and is used for maximum likelihood estimation or maximum posterior probability estimation of a probability parameter model containing a hidden variable (latent variable), and calculation is performed alternately through two steps, in this embodiment, a factor function corresponding to a rule is determined through multiple iterations.
It can be understood that, in the process of calculating the factor function corresponding to the rule, since the rule instance includes not only the extended triplet but also the original triplet, and the original triplet may also correspond to other rules, information of other rules may also be involved in the process of calculating the factor function corresponding to one rule.
The initial factor function is a value initialized randomly, and can be selected according to actual needs.
And 103, determining whether the extension triple is credible according to the factor function.
After determining the factor function of the corresponding instance of the extension triple, calculating the credibility or the incredibility probability of the extension triple according to the factor function to determine whether the extension triple is credible or not.
Optionally, an EM algorithm may be used to calculate the credibility or the untrusted probability of the extended triplet according to the factor function, and step 102 and step 103 are interdependent processes, in the actual operation process, firstly, the credibility or the untrusted probability of all triples (including the original triplet of the knowledge base and the extended triplet) is obtained according to the initialized factor functions of all instances, and then the probability is used to update the factor function of the instance through a series of calculations, and the factor function is further used to update the credibility or the untrusted probability of all triples, and so on, until the credibility or the untrusted probability of all triples and the factor function of all instances are not changed, the credibility or the untrusted probability of all triples is obtained by using the factor function updated in the last round. The probability of trustworthiness or untrustworthiness of an extended triplet may then be obtained from the probabilities of trustworthiness or untrustworthiness of all triples to determine whether the extended triplet is trustworthy.
According to the method for inspecting the triple of the knowledge base, provided by the embodiment, the rule corresponding to the expansion triple is obtained, the factor function corresponding to the rule is determined according to the initial factor function and the EM algorithm, and whether the expansion triple is credible or not is determined according to the factor function, so that whether the expansion triple is put into the knowledge base or not can be determined, the knowledge base is expanded, and the accuracy of expanding the knowledge base is improved.
Example two
The embodiment further provides a supplementary description of the method for checking the triple of the knowledge base provided in the first embodiment.
Fig. 2 is a schematic flow chart of the method for checking the triple of the knowledge base according to this embodiment. The method comprises the following steps:
step 201, obtaining a rule corresponding to an extended triple, where the extended triple is a triple obtained by performing an extension operation based on an original triple and the rule in an existing knowledge base, and the extended triple includes an ordered set composed of at least a first statement, a relational statement, and a second statement, and the relational statement is used to represent a relationship between the first statement and the second statement.
The specific operation of this step is consistent with step 101, and is not described herein again.
Step 202, determining the factor function f (t +1) after iterative operation by the EM algorithm according to the following formula:
f(t+1)=f(t)*(f’(t)/p(t))
wherein f (t) represents the value of the factor function in the t-th round, t is a positive integer greater than or equal to 0, the initial value of t is 0, f (0) is the value of the initialized factor function, f' (t) represents the empirical distribution of the factor function in the t-th round, p (t) represents the sampling distribution of the factor function in the t-th round, and the empirical distribution and the sampling distribution are calculated in the iterative operation process of the EM algorithm. The factor function is used for representing the probability whether the rule is correct or not, and the factor function is obtained according to the initial factor function and the EM algorithm.
And step 203, determining a first probability distribution and a second probability distribution corresponding to the extended triples according to the belief propagation and the factor function, wherein the first probability distribution is used for representing the probability that the extended triples should be credible, the second probability distribution is used for representing the probability that the extended triples are not credible, and the second probability distribution is 1-the first probability distribution.
The EM algorithm includes two steps, the first step is to calculate the expectation (E), referred to as E-step, and in this embodiment, the first probability distribution and the second probability distribution corresponding to the extended triplet are determined according to the belief propagation and the factor function in this step.
The second step is the maximization (M), called M-step, in this embodiment, the factor function is updated according to the first probability distribution and the second probability distribution obtained in the previous step. The resulting factor function is then used to derive a new first probability distribution and a second probability distribution based on belief propagation. And performing two-step alternate iteration, and finally determining a first probability distribution and a second probability distribution corresponding to the extension triple after the iteration is stopped.
It will be appreciated that steps 202 and 203 are interdependent processes, and that in actual operation, firstly, obtaining probability distribution (including a first probability distribution and a second probability distribution) of all triples (including the original triples and the extended triples of the knowledge base) through E-step belief propagation of the EM algorithm according to initialized factor functions of all instances, the probability distribution is used in M-step of EM algorithm, and factor function is updated through calculation in a certain process, and the factor function is also used again for E-step updating the probability distribution of all triples, and then, repeating the M-step, continuously iterating until the probability distribution of all the triples and the factor functions of all the examples are not changed, solving the final probability distribution of all the triples by using the factor function updated in the last round to the E-step, and finishing the iteration process of the EM algorithm. The probability distribution of the extended triples may then be obtained.
Step 204, determining whether the extension triplet is credible according to a target probability distribution and a preset threshold, wherein the target probability distribution is the first probability distribution or the second probability distribution.
After determining the first probability distribution and the second probability distribution corresponding to the extended triples, determining whether the extended triples are credible according to the comparison between the first probability distribution or the second probability distribution and a preset threshold.
Optionally, the preset threshold may be set as a trusted threshold, the target probability is a first probability distribution, when the target probability distribution is greater than the preset threshold, the extended triplet is trusted, and when the target probability is less than the preset threshold, the extended triplet is not trusted.
Optionally, the preset threshold may be set as an untrusted threshold, the target probability is a second probability distribution, when the target probability is smaller than the preset threshold, the extension triplet is trusted, and when the target probability is greater than the threshold, the extension triplet is untrusted.
In the method for checking the triple of the knowledge base, the target probability distribution of the extended triple and the factor function corresponding to the rule corresponding to the extended triple are continuously updated in the EM algorithm, so that the correct probability of the rule, that is, the reliability of the rule is further learned on the basis of the extended knowledge base, and the calculation of the reliability of the extended triple is calculated based on the factor function corresponding to the corresponding rule, and the calculation of the factor function corresponding to the corresponding rule involves the reliability of the original triple in the knowledge base, so that the calculation of the reliability of the extended triple also takes the global correlation between the knowledge into consideration, and the knowledge base can be efficiently, high-quality and accurate extended.
EXAMPLE III
The embodiment specifically exemplifies the method for checking the triple of the knowledge base provided in the above embodiment.
For example, original triples in 5 knowledge bases are selected and sequentially numbered as e 1-e 5, the number of rules found according to the knowledge bases is 4, and sequentially numbered as r 1-r 4, and the number of extended triples obtained by performing an extension operation according to the rules is 3, and sequentially numbered as e6, e7, and e 8.
For each triplet, the probability that each original triplet should be trusted is represented by a third probability distribution, the probability that each original triplet is untrustworthy is represented by a fourth probability distribution, the probability that each expanded triplet is trusted is represented by a first probability distribution, and the probability that each expanded triplet is untrustworthy is represented by a second probability distribution. Wherein the fourth probability distribution is 1-the third probability distribution, and the second probability distribution is 1-the first probability distribution. For example, the probabilities of 5 original triples being credible are sequentially represented as b 1-b5, the fourth probability distributions thereof are sequentially represented as (1-b1) to (1-b5), and the first probability distributions of 3 extended triples being a1 to a3, the probabilities of being incredible are sequentially represented as (1-a1) to (1-a 3). The credibility or the incredibility of 5 original triples and 3 expansion triples is sequentially represented by x 1-x 8, wherein x is [ x1, x2, x3, x4, x5, x6, x7 and x8] to represent the credibility or the incredibility of all triples, wherein x 1-x 8 are binary functions with 0 or 1, the variables are respectively called triples e 1-e 8, 0 represents the incredibility of the triples, and 1 represents the credibility of the triples.
For example, 5 triples are selected from the knowledge base, and 4 rules have been discovered, as shown in tables 1 and 2.
TABLE 1 original triplet
Figure BDA0001204802820000101
TABLE 2 rules
Numbering Rules
r1 (H, residence, Y) + (Y, country, Z) ═>(H, nationality, Z)
r2 (H, block, Y) + (Y, commercial, Z) ═>(H, lower City, Z)
r3 (H, nationality, Y) + (H, radix rehmanniae, Z)>(Z, national, Y)
r4 (H, lower municipal, Y) ═>(H, national, Y)
In the same rule, H represents the same unknown statement, H between the rules does not necessarily represent the same statement, and obviously, Y and Z are also the same, so that the first statement and the second statement which satisfy the corresponding relation statement in the corresponding rule in the existing triple can be substituted to obtain the instance of the rule or the extended triple.
And performing an extension operation according to the original triple and the rule to obtain an extended triple, as shown in table 3.
Table 3 extension triplets
And continuing to perform the expansion operation according to the original triple, the expansion triple and the rule to obtain a new expansion triple, as shown in table 4.
TABLE 4 New extension triplets
Figure BDA0001204802820000103
And finally obtaining the data comprising the original triple and all the extended triples.
TABLE 5 all triplets against variable x
Triple unit Numbering x
(B,R1,A) e1 x1
(A,R2,C) e2 x2
(B,R3,D) e3 x3
(E,R4,C) e4 x4
(D,R5,E) e5 x5
(B,R6,C) e6 x6
(D,R7,C) e7 x7
(D,R2,C) e8 x8
Examples of deriving all rules from the original triples and the extended triples include:
example 1: (B, R1, a) + (a, R2, C) ═ B, R6, C)
Example 2: (B, R3, D) + (B, R6, C) ═ (D, R2, C)
Example 3: (D, R5, E) + (E, R4, C) ═ (D, R7, C)
Example 4: (D, R7, C) ═ (D, R2, C)
It is appreciated that instance 1 is an instance of rule r1, instance 2 is an instance of rule r3, instance 3 is an instance of rule r2, and instance 4 is an instance of rule r 4. It is to be understood that, in the case of multiple triples selected from the knowledge base, there may be a case where multiple instances correspond to one rule, and this is only an example and is not a limitation.
Factor graphs are constructed from all triples and examples, as shown in fig. 5, where q ═ q1, q2, q3, q4, q5, q6, q7, q8 represent the probability distribution of whether a triplet is authentic or untrusted, i.e., if the triplet is the original triplet, such as e1, q1 ═ b1 when x1 is 1, q1 is 1-b1 when x1 is 0, and if the triplet is the extension, such as e7, q7 a2 when x7 is 1, and q7 is 1-a2 when x7 is 0. f1-f4 represent factor functions for each instance, respectively, as shown in Table 6.
TABLE 6 comparison of triplets with variable x, i.e. factor function q
Figure BDA0001204802820000111
Figure BDA0001204802820000121
For example, if the number of atomic rules of a rule is 3 (e.g., (H, daughter, Y) + (Z, husband, Y) > (H, daughter, Z)), then the examples of the rule are respectively composed of three triplets, such as the above example 1, the triplets e1, e2, and e6, and the variables thereof are x1, x2, and x6, respectively, and the factor function of the example is represented as f1 ═ f11, f12, f13, f14, f15, f16, f17, f18, where f11 to f18 respectively represent the probability that the rule corresponding to the example is correct when the variables of the three triplets respectively take 8 combinations of 1 or 0, as shown in table 7. It is understood that if the number of atomic rules of a rule is 2, as in example 4, the factor function f4 of this example is [ f41, f42, f43, f44 ].
For convenience of description later, we refer to the nodes of the circles in the factor graph as variable nodes, and the square nodes as factor nodes, and then the factor nodes include q nodes and f nodes.
TABLE 7
x1 x2 X6 f1
0 0 0 f11
0 0 1 f12
0 1 0 f13
0 1 1 f14
1 0 0 f15
1 0 1 f16
1 1 0 f17
1 1 1 f18
There are 4 examples of this embodiment, and there are 4 factor functions accordingly, i.e., [ f1, f2, f3, f4], where f denotes the set of all factor functions.
The factor functions for the 4 examples are shown in table 8:
TABLE 8
Then, f2 ═ f21, f22, f23, f24, f25, f26, f27, f28]
f3=[f31,f32,f33,f34,f35,f36,f37,f38]
f4=[f41,f42,f43,f44]
After the factor graph is constructed, the probability distribution q of the factor function and the variables corresponding to all the triples is obtained by iteration through an EM algorithm.
The EM Algorithm is a maximum Expectation Algorithm (Expectation Maximization Algorithm), which is an iterative Algorithm, and is used for maximum likelihood estimation or maximum posterior probability estimation of a probability parameter model containing hidden variables (latent variables), and the calculation is performed alternately through two steps:
the first step is to calculate the expectation (E), called E-step, using the existing estimate of the hidden variable to calculate its maximum likelihood estimate, which in this embodiment is to calculate the probability distribution (including the first probability distribution and the second probability distribution) q of the variable x in this step;
the second step is to maximize (M), called M-step, the maximum likelihood value found in step E to calculate the value of the parameter, in this embodiment, the factor function is updated based on the probability distribution of the variable found in the previous step.
The factor function obtained in M-step is used in the next E-step calculation, and the process is continuously alternated until the distribution q of the variable x, namely the factor function, is not changed any more, and correspondingly, the first probability distribution and the second probability distribution are not changed any more, and the first probability distribution and the second probability distribution can be used for determining whether the extension triples are credible or not.
The specific process is as follows:
(1) initialization
After the factor graph is constructed, all q and f need to be initialized.
1) Initialization of q
From the above, each triplet, e.g. e1, with q1 being a binary function, represents the probability of the triplet when the variable x1 takes 0 or 1, respectively, which can be represented by {1-b1, b1}, where the front represents the probability of the variable taking 0 and the rear represents the probability of the variable taking 1. For the original triple, which already exists in the knowledge base, it may be initialized to 0.01, 0.99. For the extension triple, the value of q is consistent with the correctness of the rule corresponding to the value.
All q's in this example are initialized as follows:
the initialization values of q1, q2, q3, q4 and q5 are all {0.01,0.99 };
q6 initialization value is {0.2,0.8 };
q7 initialization value is {0.3,0.7 };
the q8 initialization value is 0.1, 0.9.
2) The initialization of the factor function f is random initialization as shown in table 9.
TABLE 9
(x1,x2,x6) f1 (x3,x6,x8) f2 (x4,x5,x8) f3 (x7,x8) f4
(0,0,0) 0.125 (0,0,0) 0.125 (0,0,0) 0.125 (0,0) 0.25
(0,0,1) 0.125 (0,0,1) 0.125 (0,0,1) 0.125 (0,1) 0.25
(0,1,0) 0.125 (0,1,0) 0.125 (0,1,0) 0.125 (1,0) 0.25
(0,1,1) 0.125 (0,1,1) 0.125 (0,1,1) 0.125 (1,1) 0.25
(1,0,0) 0.125 (1,0,0) 0.125 (1,0,0) 0.125 - -
(1,0,1) 0.125 (1,0,1) 0.125 (1,0,1) 0.125 - -
(1,1,0) 0.125 (1,1,0) 0.125 (1,1,0) 0.125 - -
(1,1,1) 0.125 (1,1,1) 0.125 (1,1,1) 0.125 - -
(2)E-step
Belief propagation is an algorithm in a factor graph that solves for the edge distribution of variable nodes. We take the variable x8 and the factor connected to it as an example to explain the process of belief propagation.
Factor nodes pass information to variable nodes (first round): we denote by s (f- > x) and s (q- > x) the information passed by the factor node f and the factor node q, respectively, to the variable node x, initialized to the edge distribution of x obtained from f and q, then:
s1(q8->x8)={0.1,0.9};
s1(f2->x8)={0.5,0.5};
s1(f3->x8)={0.5,0.5};
s1(f4->x8)={0.5,0.5};
wherein s1 represents the first round of factor node passing information to variable node, s1(q8- > x8) represents the first round of q8 factor node passing information to x8 variable node connected to it, i.e. passing the initialized value {0.1,0.9} of q8 to variable node x 8; s1(f2- > x8) ═ 0.5,0.5} indicates that the probabilities of x8 ═ 0 in the f2 initialization value are added, and the probabilities of x8 ═ 1 are added to obtain {0.5,0.5}, and the sum is transmitted to the variable node x8 connected with the sum; the transfer process of f3 and f4 is similar to f2 and will not be described in detail.
It can be understood that the process of the factor node transferring to the other variable nodes x 1-x 7 is similar to the process of the factor node transferring to the x8 node, and is not described herein again.
Variable nodes pass information to factor nodes (first round): we denote by h (x- > f) the information passed by the variable node x to the factor node f, and there are:
h1(x8->f2)=s1(q8->x8)*s1(f3->x8)*s1(f4->x8)={0.1*0.5*0.5,0.9*0.5*0.5};
h1(x8->f3)=s1(q8->x8)*s1(f2->x8)*s1(f4->x8)={0.1*0.5*0.5,0.9*0.5*0.5};
h1(x8->f4)=s1(q8->x8)*s1(f2->x8)*s1(f3->x8)={0.1*0.5*0.5,0.9*0.5*0.5};
wherein h1 represents the information transferred from the variable node of the first round to the factor node, h1(x8- > f2) represents the information transferred from the variable node of the first round x8 to the factor node of f2 connected with the variable node of the first round x8, i.e. the q8 factor node in the previous step is multiplied by the information transferred to x8 from the nodes of f3 and f4 except f2 and then transferred to the factor node f2, the process transferred from x8 to f3 and f4 is similar to the process transferred from x8 to f2, which is not repeated here, and c1 of the binary function in the form of { c1, d1} and multiplication of d1 are multiplied to obtain a new binary function { the result multiplied by c1 and the result multiplied by d1 }.
It is understood that the process of the other variable nodes to the factor node connected to the other variable nodes is similar to the process of the variable x8 to the factor node connected to the other variable nodes, and the details are not repeated here.
In this embodiment, the factor node and the variable node are transferred to the factor node and between two connected nodes.
Normalizing the result of the multiplication:
h1(x8->f2)={0.1*0.5*0.5,0.9*0.5*0.5}={0.025,0.225}={0.1,0.9};
h1(x8->f3)={0.1*0.5*0.5,0.9*0.5*0.5}={0.025,0.225}={0.1,0.9};
h1(x8->f4)={0.1*0.5*0.5,0.9*0.5*0.5}={0.025,0.225}={0.1,0.9};。
factor nodes pass information to variable nodes (second round):
taking the variable node x8 as an example, the factor nodes connected with the variable node x8 comprise a q8 node, an f2 node, an f3 node and an f4 node, and then
s2(q8->x8)=s1(q8->x8);
s2(f2->x8)=s1(f2->x8)&h1(x3->f2)&h1(x6->f2);
s2(f3->x8)=s1(f3->x8)&h1(x4->f3)&h1(x5->f3);
s2(f4->x8)=s1(f4->x8)&h1(x7->f4);
Wherein s represents the second round of information transfer from the factor node to the variable node, s (q- > x) represents the q node to the x node, the information transfer from the q node to the x node is not changed in the process of one E-step cycle, until the E-step is finished, a new q probability distribution is obtained, when the E-step is finished after M-step, the information transfer from q to x is the new q probability distribution, s (f- > x) represents the information transfer from the factor node f to the variable node x in the second round, namely the information transfer from f to x in the first round is s (f- > x), the information transfer from x to f in the first round is h (x- > f), and the information transfer from x to f in the first round is h (x- > f), the information transfer is multiplied by the combination of binary functions in the form of { c, d } in the formula, and then the multiplication is carried out on x, an 8-value function is obtained, and then a new binary function is obtained by adding x8 to 0 and x8 to 1, as shown in table 10.
For example, as shown above, s1(f2- > x8) ═ 0.5,0.5}, since the normalized values of h1(x3- > f2) and h1(x6- > f2) are not specifically given in the above (both obtained and saved in the actual operation process), in order to explain the transfer process more clearly, we assume that h1(x3- > f2) ═ 0.2,0.8}, and h1(x6- > f2) {0.3,0.7}, then:
s2(f2->x8)={0.5,0.5}&h1{0.2,0.8}&{0.3,0.7}={0.5,0.5}
{0.03,0.07,0.12,0.28, 0.03,0.07,0.12,0.28} information conveyed by f2 corresponding to the combination of 8 of the corresponding variables x8, x3, x6 taken as 0 or 1:
watch 10
x8 x3 x6 s2(f2->x8)
0 0 0 0.5*0.2*0.3=0.03
0 0 1 0.5*0.2*0.7=0.07
0 1 0 0.5*0.8*0.3=0.12
0 1 1 0.5*0.8*0.7=0.28
1 0 0 0.5*0.2*0.3=0.03
1 0 1 0.5*0.2*0.7=0.07
1 1 0 0.5*0.8*0.3=0.12
1 1 1 0.5*0.8*0.7=0.28
Likewise, the information that f3 and f4 convey to x8 can be obtained: s2(f3- > x8) and s2(f4- > x8)
Variable nodes pass information to factor nodes (second round): the calculation method is the same as the first round, and is not described herein again.
Factor nodes pass information to variable nodes (third round): the calculation method is the same as the second round, and is not described herein again.
The above process is iterated until the information delivered changes little (i.e., converges).
After the iteration is complete (algorithm convergence), we compute the probability distribution of the variables based on the information passed by the factors to the variable nodes. Also taking variable x8 as an example, assume convergence after 15 iterations:
q8=s15(q8->x8)*s15(f2->x8)*s15(f3->x8)*s15(f4->x8)
wherein s15(q8- > x8) is s1(q8- > x 8).
For convenience of explanation, assume:
q8={0.5*0.5*0.4*0.35,0.5*0.5*0.6*0.65}={0.035,0.0975}
after normalization, q8 is {0.264,0.736}, that is, the probability distribution q8 of the variable x8 obtained after the E-step first wave cycle is {0.264,0.736}, and then the first probability distribution of the corresponding triplet E8 is 0.264, and the second probability distribution is 0.736.
Likewise, the probability distributions of other variables may be found and will not be described further herein.
It can be understood that this result is obtained only after the first wave cycle of E-step, and to be used in M-step, the factor function f is updated, then according to the updated factor function, the second wave cycle of E-step is performed again, and a new set of probability distributions of the variables is obtained again to be used in M-step, and the iteration is performed for many times until the probability distribution of the final variable and the updated factor function are not changed any more. The probability distribution of the variables obtained at this time is the target probability distribution, and is used as a standard for judging whether the corresponding triples are credible.
(3)M-step
Updating of the factor function:
for convenience of explanation, in this step, let us assume that the empirical distribution of the factor function is f ' (0), i.e., f ' (0) ═ f1 ' (0), f2 ' (0), f3 ' (0), f4 ' (0), and f ' is obtained according to the following formula:
f1’(0)=q1^q2^q6;
f2’(0)=q3^q6^q8;
f3’(0)=q4^q5^q8;
f4’(0)=q7^q8;
where q1 to q8 are probability distributions of variables x1 to x8 obtained in E-step, and f1 '(0) to f 4' (0) represent empirical distributions of the factor functions obtained in the cycle where t is 0, which are 8-value functions or 4-value functions of the f1 form as described above, and ^ represents a combined multiplication.
It will be understood that the empirical distribution of the factor functions used is constant during a complete M-step cycle, i.e. is determined from the probability distribution of the variables previously obtained by E-step, i.e. the distribution is constant over time
f1’(t)=f1’(3)=f1’(2)=f1’(1)=f1’(0);
And (4) until the M-step is finished, obtaining the probability distribution of a new variable through the E-step, and obtaining the experience distribution of a new factor function according to the probability distribution of the new variable.
Taking f1 '(0) as an example, if q1, q2 and q3 obtained in E-step are {0.01,0.99}, and {0.1,0.9}, respectively, the calculation of the empirical distribution of f 1' (0) is shown in table 11.
TABLE 11
(x1,x2,x6) f1’(0)
(0,0,0) 0.01*0.01*0.1=0.00001
(0,0,1) 0.01*0.01*0.9=0.00009
(0,1,0) 0.01*0.99*0.1=0.00099
(0,1,1) 0.01*0.99*0.9=0.00891
(1,0,0) 0.99*0.01*0.1=0.00099
(1,0,1) 0.99*0.01*0.9=0.00891
(1,1,0) 0.99*0.99*0.1=0.09801
(1,1,1) 0.99*0.99*0.9=0.88209
Then, the sampling distribution of the factor function is obtained through sampling, and the specific process is as follows:
firstly, initializing: t is 0, and t represents the number of M-step outer layer cycles;
next, for each variable x1 to x8, a value (0 or 1) is randomly initialized, for example, m ═ x1, x2, x3, x4, x5, x6, x7, x8 ═ 0,1,0,0,1,1,0,1] where m is an array, and we refer to sample data for storing the sample values of the variables;
again, the factor function f is initialized randomly, as with E-step, and is not described here again.
Finally, the cycle is as follows:
when T < T (T is 50, T represents the maximum number of M-step outer cycle, and it is understood that other values are possible, and this is not a limitation):
1) if t is 0, L takes a large value (e.g., L is 100, L represents the maximum number of cycles of the M-step inner layer cycle), and if t >0, L takes a large value (e.g., L is 30);
initializing l to 0, wherein l represents the number of M-step inner layer circulation;
when L < L:
1. for each variable x 1-x 8, under the condition that the values of other variables are determined, the probability distribution of the variable is obtained, and the value of the variable is determined according to the probability distribution.
For example, taking variable x8 as an example, assuming that x 1-x 7 respectively take values of [0,1,0,0,1,1,0], the current factor function is initialized and can be considered to be known, since the probability distributions of f2, f3 and f4 are related to x8, the probability distributions of x8 with respect to f2, f3 and f4 can be obtained, taking f3 as an example, and assuming that the current f3 takes values as shown in the following table, for convenience of description, we can use f3(0) to represent initialized f3, as shown in table 12, since x4 and x5 take values of [0,1], so that the probability distribution of x8 with respect to f3 can be obtained as {0.12,0.18} (it is necessary to normalize these two values so that their sum is equal to 1), and then normalize these two values to (0.4, 0.6). Similarly, the probability distributions of x8 for f2 and f4 can be obtained, assuming {0.3,0.7} and {0.2,0.8} respectively. The probability distribution of x8 is equal to the multiplication (and normalization) of these 3 probability distributions, i.e., the probability distribution of x8 is {0.4 x 0.3 x 0.2,0.6 x 0.7 x 0.8} normalized to {0.067,0.933 }. Having obtained this distribution, we sample x8 according to this distribution and update the sample array m.
TABLE 12
(x4,x5,x8) f3(0)
(0,0,0) 0.025
(0,0,1) 0.025
(0,1,0) 0.12
(0,1,1) 0.18
(1,0,0) 0.05
(1,0,1) 0.05
(1,1,0) 0.10
(1,1,1) 0.45
Similarly, the sampled values of several other variables x 1-x 7 can be obtained, and finally, the sampled values of all variables in m are updated, and the sampled values of a group of variables are obtained and stored.
2. And (4) updating l to l +1, and entering the next inner layer circulation.
The loop variable is initialized to the group of sampling values obtained by the last loop, after L loops, the inner-layer loop is finished, the sampling values of the L groups of variables are obtained, and the sampling distribution of each factor function is calculated according to the M groups of sampling values obtained by the last M (if M is 15) loops, wherein the specific process is as follows:
for example, taking f3 as an example, assuming that the sampling results of the last 15 rounds (x4, x5, x8) are shown in table 13, for convenience of description, we can use p3(0) to represent the sampling distribution of f3 in the round of t ═ 0 round.
Watch 13
(x4,x5,x8) Number of occurrences in 15 sets of sample values p3(0)
(0,0,0) 1 1/15
(0,0,1) 2 2/15
(0,1,0) 1 1/15
(0,1,1) 1 1/15
(1,0,0) 0 0/15
(1,0,1) 1 1/15
(1,1,0) 3 3/15
(1,1,1) 6 6/15
The sampling distribution of several other factor functions can be obtained as well, and will not be described herein again.
The update factor function is as follows:
for example, take f3 as an example:
f3(1)=f3(0)*(f3’(0)/p3(0))
where f3(1) represents the value of the factor function f3 updated after the outer-layer cycle t becomes 0 cycle, f3(0) represents the value of the initialized factor function f3, f 3' (0) is the empirical distribution of the factor function f3 obtained from the probability distribution of the variables of E-step, and p3(0) represents the sample distribution of the factor function f3 when t becomes 0.
It is understood that if t is 2, then f3(2) is f3(1) × (f 3' (1)/p3 (1));
here, f3(2) is the value of the factor function f3 updated cyclically with t being 1, f 3' (1) is the empirical distribution of the factor function f3 when t being 1, and p3(1) is the sampling distribution of the factor function f3 when t being 1, and so on, and the description thereof is omitted.
Updated values of other factor functions can be obtained as well, and are not described herein again.
It is understood that if there are multiple instances of the same rule r3, f 3' (0) is the result of adding multiple empirical distributions obtained from multiple instances, and similarly, p3(0) is the result of adding sample distributions obtained from multiple instances.
For example, if three triplets x9, x10, and x11 are also examples of the rule r3, then an empirical distribution and a sample distribution are obtained according to the above process, as with x4, x5, and x8, then f 3' (0) in the formula is the sum of the two empirical distributions, and p3(0) is the sum of the two sample distributions.
And updating t to t +1, and entering the next outer layer cycle.
And when the outer layer cycle is finished for T times, obtaining a group of updated factor functions f1(T), f2(T), f3(T) and f4 (T).
It can be understood that the factor function obtained by the M-step loop is the result of the loop, the resulting factor function is to be used again to initialize the factor function of the E-step, and the loop is continued until the obtained factor function and the probability distribution of the variable do not change (i.e. converge), the obtained factor function is the final factor function, and can be used to judge the correctness of the rule.
After the EM algorithm is finished, obtaining probability distribution of all variables, wherein the probability distribution of the expansion triple is target probability distribution, a preset threshold value is set, whether the expansion triple is credible or not is determined according to the target probability distribution and the preset threshold value, and the target probability distribution is first probability distribution or second probability distribution.
For example, assuming that the probability distribution of the extended triplet e8 is {0.1,0.9}, i.e. the first probability distribution is 0.9, and the second probability distribution is 0.1, if a predetermined threshold is set as a confidence threshold i (e.g. i ═ 0.98), the target probability distribution is taken as the first probability distribution, i.e. 0.9, and when 0.9> j, the extended triplet is considered as being trustworthy and can be placed in the knowledge base to be augmented, and if 0.9< j, the extended triplet is considered as being untrustworthy and is not placed in the knowledge base. If the threshold is set to be an untrusted threshold j (for example, j is 0.2), the target probability distribution is 0.1, and the extended triplet is trusted when the second probability distribution is less than 0.2 and is untrusted when the second probability distribution is greater than 0.2.
It should be noted that, the formula f (t +1) ═ f (t) × (f' (t)/p (t)) for determining the factor function can be derived by the following procedure:
after a factor graph is constructed, the learning problem of the credibility of the rule is converted into the learning of the factor:
suppose that
Figure BDA0001204802820000211
Representing a factor graph in which the sets of variable nodes and factor nodes are respectivelyAnd
Figure BDA0001204802820000213
for any one
Figure BDA0001204802820000214
In (2), we denote by N (u) the set of nodes connected to node u, where u denotes
Figure BDA0001204802820000215
One node in the set, for example, the node connected to the factor node f2 in the embodiment of the present invention, has variable nodes x3, x6, and x 8. By using
Figure BDA0001204802820000216
Representing any one set, for any one setWe use
Figure BDA0001204802820000218
Represents from
Figure BDA0001204802820000219
To
Figure BDA00012048028200002110
Of a set of mapping functions, wherein
Figure BDA00012048028200002111
To represent
Figure BDA00012048028200002112
Is that
Figure BDA00012048028200002113
Is selected from the group consisting of (a) a subset of,comprises
Figure BDA00012048028200002115
Some or all of the variable nodes. By usingTo representIn a set of unknown functions belonging to the same family, i.e.The method includes representing a set of a plurality of functions, that is, representing a set of factor functions belonging to the same family, where each function in the set corresponds to the same rule, and values of the functions are consistent, for example, each rule in the embodiment of the present invention is an unknown function summarized from a plurality of instances belonging to the same family. For the
Figure BDA00012048028200002119
A function of f:
Figure BDA00012048028200002120
there are m (f) variables, and each variable is derived from
Figure BDA00012048028200002121
Taking the value in the step (1). Let factor graph
Figure BDA00012048028200002122
Each section ofDot
Figure BDA00012048028200002123
And a function
Figure BDA00012048028200002124
In the context of a correlation, the correlation,
Figure BDA00012048028200002125
denotes guIs that
Figure BDA00012048028200002126
Wherein m (g) isu) | n (u) |. Where u represents a factor node in the factor graph, guRepresenting a factor function corresponding to the factor node u; n (u) represents a set of variable nodes directly connected to the factor node u; m (g)u) Where | n (u) | represents the size of the set of variable nodes directly connected to the factor node u, i.e. the number of variable nodes, it should be noted that, for different nodes u,guand gu′May be the same, i.e. if guAnd gu′Corresponding to the same rule, then guAnd gu′The same is true. For each function
Figure BDA00012048028200002128
We denote the aggregate by ^ (f)
Figure BDA00012048028200002129
I.e., the expression of ^ (f)
Figure BDA00012048028200002130
The set of all nodes in relation to the function f, i.e. the ^ (f) representation
Figure BDA00012048028200002131
The set of all functions belonging to the same family as the function f, i.e. all factor functions in the set correspond to the same rule. Then
Figure BDA00012048028200002132
Product form of representative function
Figure BDA00012048028200002133
Assuming that the product represents a set of random variables
Figure BDA00012048028200002134
Joint distribution of
Figure BDA00012048028200002135
Wherein the content of the first and second substances,representing a set of variable nodes in the factor graph;representation and variable node
Figure BDA00012048028200002138
Related variables, where each variable is consistent with each variable in embodiments of the present invention;
Figure BDA00012048028200002139
representing a set of such variables, e.g. a set of three variables corresponding to a factor function in the embodiment, i.e. here
Figure BDA0001204802820000221
I.e., variable x in the embodiment of the present invention, since
Figure BDA0001204802820000222
Is unknown, meaning that this joint distribution is also unknown.
Assuming an empirical distribution is observed
Figure BDA0001204802820000223
Representing a joint distribution
Figure BDA0001204802820000224
The empirical distribution of (3) is an empirical distribution of a factor function calculated from an empirical distribution of variables obtained from an E-step of the EM algorithm in an embodiment of the present invention. The objective is to estimate each factor function based on this empirical distribution
Figure BDA0001204802820000225
Specifically, by the pair
Figure BDA0001204802820000226
Sampling M times, each value
Figure BDA0001204802820000227
Observe that
Figure BDA0001204802820000228
Second, the observed set { a }(i)I is 1,2, …, denoted D.
By usingRepresenting logp (D | F), where logp (D | F) represents the likelihood function given the observation set D and p (D | F) represents the posterior distribution of F, then we need to maximize
Figure BDA00012048028200002210
For each one
Figure BDA00012048028200002211
By D:SDenotes the mapping of the combination observed in D on S, e.g. a set of variables S ═ x3, x6, x8]May be [0, 0]],[0,0,1],…,[1,1,1]. For each one
Figure BDA00012048028200002212
And any one vector
Figure BDA00012048028200002213
By mS(a) The number of times the observed combination is a is indicated. For any purpose
Figure BDA00012048028200002214
By a:SThe subvector mapped on S is represented, that is, a is represented for the specific value of the variable combination S.
We define:
Figure BDA00012048028200002215
wherein the content of the first and second substances,
Figure BDA00012048028200002216
representing a set of variables, e.g. (x3, x6, x8), a representing a specific value of the set of variables, e.g., [0,0],…,[1,1,1]Then, there is,
Figure BDA00012048028200002217
is represented by A
Figure BDA00012048028200002218
The following can be obtained:
:=A-M log z(10)
now for each
Figure BDA00012048028200002219
We derive
Figure BDA00012048028200002220
So as to obtain the compound with the characteristics of,
Figure BDA00012048028200002221
and
Figure BDA00012048028200002222
is that
Figure BDA00012048028200002223
The function of (a):
Figure BDA00012048028200002224
wherein, Λ*(f) The combination of all the variables corresponding to the set of functions Λ (f) is represented.
For each a ∈ ^ a*(f) The method comprises the following steps:
Figure BDA0001204802820000231
introduction 1: for each one
Figure BDA0001204802820000232
The above-mentioned partial derivative formula is set to 0 for each
Figure BDA0001204802820000234
The following can be obtained:
Figure BDA0001204802820000235
moving f (b) to one side of the equation as ft+1(b) An iterative formula of the algorithm can be obtained:
Figure BDA0001204802820000236
wherein the content of the first and second substances,a sampling distribution obtained by sampling in the t-th round, for example, a sampling distribution p3(0) of a factor function calculated by an example in the embodiment of the present invention, is shown.
It should be noted that, for convenience of description, the iterative formula (16) is expressed as:
f(t+1)=f(t)*[f’(t)/p(t)],
wherein f (t) represents the value of the factor function in the t round, i.e. f in formula (16)t(b) T is a positive integer greater than or equal to 0 and the initial value of t is 0, f (0) is the value of the initialized factor function, f' (t) represents the empirical distribution of said factor function over t rounds, i.e. in equation (16)
Figure BDA0001204802820000238
p (t) represents the sampling distribution of the factor function in the t-th round, i.e. in equation (16)
Figure BDA0001204802820000239
According to the method for checking the triple of the knowledge base, the target probability distribution of the expanded triple and the factor function of the rule corresponding to the expanded triple are continuously updated in the EM algorithm, so that the correct probability of the rule, namely the credibility of the rule, is further learned on the basis of the expanded knowledge base, the calculation of the credibility of the expanded triple is obtained by calculation based on the factor function of the rule corresponding to the expanded triple, and the calculation of the factor function of the rule corresponding to the expanded triple relates to the credibility of the original triple in the knowledge base, so that the calculation of the credibility of the expanded triple also takes the global correlation between the knowledge into consideration, and the knowledge base can be efficiently, highly accurately and qualitatively expanded.
Example four
The embodiment provides a device for checking triple of a knowledge base, which is used for executing the method for checking triple of a knowledge base in the first embodiment.
Fig. 3 is a schematic structural diagram of the apparatus for checking a triple of a knowledge base provided in this embodiment. The apparatus 40 for checking triple of knowledge base of the present embodiment includes an obtaining module 41, a determining module 42 and a processing module 43.
The obtaining module 41 is configured to obtain a rule corresponding to an extended triple, where the extended triple is a triple obtained by performing an extension operation based on an original triple and the rule in an existing knowledge base, the extended triple includes an ordered set at least including a first statement, a relational statement, and a second statement, and the relational statement is used to represent a relationship between the first statement and the second statement; the determining module 42 is configured to determine a factor function corresponding to the rule obtained by the obtaining module 41, where the factor function is used to indicate a probability of whether the rule is correct, and the factor function is obtained according to an initial factor function and an EM algorithm; the processing module 43 is configured to determine whether the extension triplet is authentic according to the factor function determined by the determining module 42.
The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.
According to the device for inspecting the triple of the knowledge base, the rule corresponding to the expansion triple is obtained, the factor function corresponding to the rule is determined according to the initial factor function and the EM algorithm, whether the expansion triple is credible or not is determined according to the factor function, whether the expansion triple is put into the knowledge base or not can be further determined, the knowledge base is expanded, and the accuracy of expanding the knowledge base is improved.
EXAMPLE five
This embodiment further supplements the apparatus for checking triple in a knowledge base in the fourth embodiment to perform the pattern editing method in the second embodiment.
As shown in fig. 4, the processing module 43 in the apparatus 40 for knowledge base triple verification of the present embodiment includes a first sub-module 51 and a second sub-module 52.
The first submodule 51 is configured to determine, according to belief propagation and a factor function, a first probability distribution and a second probability distribution corresponding to the extended triplet, where the first probability distribution is used to indicate a probability that the extended triplet should be trusted, the second probability distribution is used to indicate a probability that the extended triplet is untrustworthy, and the second probability distribution is 1 — the first probability distribution; the second sub-module 52 is configured to determine whether the extension triplet is reliable according to a target probability distribution and a preset threshold, where the target probability distribution is a first probability distribution or a second probability distribution.
Optionally, the determining module 42 is specifically configured to determine the factor function f (t +1) after the iterative operation is performed by the EM algorithm according to the following formula:
f(t+1)=f(t)*[f’(t)/p(t)];
wherein f (t) represents the value of the factor function in the t-th round, t is a positive integer greater than or equal to 0, the initial value of t is 0, f (0) is the value of the initialized factor function, f' (t) represents the empirical distribution of the factor function in the t-th round, p (t) represents the sampling distribution of the factor function in the t-th round, and the empirical distribution and the sampling distribution are obtained in the iterative operation process of the EM algorithm.
Optionally, the determining module 42 is further configured to stop the iterative operation when the value of f (t) no longer changes.
The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.
According to the device for checking the triple of the knowledge base, the target probability distribution of the expanded triple and the factor function corresponding to the rule corresponding to the expanded triple are continuously updated in the EM algorithm, so that the correct probability of the rule, namely the reliability of the rule, is further learned on the basis of the expanded knowledge base, the calculation of the reliability of the expanded triple is obtained by calculation based on the factor function corresponding to the corresponding rule, and the calculation of the factor function corresponding to the corresponding rule relates to the reliability of the original triple in the knowledge base, so that the global correlation among knowledge is also considered in the calculation of the reliability of the expanded triple, and the knowledge base can be efficiently, high-quality and high-accuracy expanded.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of triple verification of a knowledge base, comprising:
acquiring a rule corresponding to an extended triple, wherein the extended triple is a triple obtained by performing extension operation based on an original triple in an existing knowledge base and the rule, the extended triple comprises an ordered set at least comprising a first statement, a relational statement and a second statement, and the relational statement is used for representing the relationship between the first statement and the second statement;
determining a factor function corresponding to the rule, wherein the factor function is used for representing the probability of whether the rule is correct or not, and the factor function is obtained according to an initial factor function and an EM algorithm;
determining whether the extension triple is credible according to the factor function;
the determining a factor function corresponding to the rule includes:
determining the factor function f (t +1) after iterative operation by the EM algorithm according to the formula:
f(t+1)=f(t)*[f’(t)/p(t)];
wherein f (t) represents the value of the factor function in the t-th round, t is a positive integer greater than or equal to 0, the initial value of t is 0, f (0) is the value of the initialized factor function, f' (t) represents the empirical distribution of the factor function in the t-th round, p (t) represents the sampling distribution of the factor function in the t-th round, and the empirical distribution and the sampling distribution are obtained in the iterative operation process of the EM algorithm.
2. The method of claim 1, wherein the determining whether the extended triplet is trustworthy according to the factor function comprises:
determining, according to belief propagation and the factor function, a first probability distribution and a second probability distribution corresponding to the extended triplets, the first probability distribution representing a probability that the extended triplets are trustworthy, the second probability distribution representing a probability that the extended triplets are untrustworthy, and the second probability distribution being 1 — the first probability distribution;
and determining whether the extension triple is credible according to a target probability distribution and a preset threshold, wherein the target probability distribution is the first probability distribution or the second probability distribution.
3. The method of claim 2, wherein the determining whether the extension triplet is trustworthy according to the target probability distribution and the preset threshold comprises:
if the preset threshold is a credible threshold, the target probability distribution is a first probability distribution, and if the target probability distribution is greater than or equal to the preset threshold, the credibility of the extension triple is determined; if the target probability distribution is smaller than the preset threshold, determining that the extension triple is not credible;
if the preset threshold is an untrusted threshold, the target probability distribution is a second probability distribution, and if the target probability distribution is greater than or equal to the preset threshold, the extension triple is determined to be untrusted; and if the target probability distribution is smaller than the preset threshold, determining that the extension triple is credible.
4. A method according to any of claims 1 to 3, wherein the iterative operation is stopped when the value of f (t) no longer changes.
5. An apparatus for triple verification of a knowledge base, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a rule corresponding to an extended triple, the extended triple is a triple obtained by performing extension operation based on an original triple in an existing knowledge base and the rule, the extended triple comprises an ordered set at least comprising a first statement, a relation statement and a second statement, and the relation statement is used for representing the relation between the first statement and the second statement;
the determining module is used for determining a factor function corresponding to the rule, the factor function is used for representing the probability of whether the rule is correct, and the factor function is obtained according to an initial factor function and an EM algorithm;
the processing module is used for determining whether the extension triple is credible according to the factor function;
the determining module is specifically configured to:
determining the factor function f (t +1) after iterative operation by the EM algorithm according to the formula:
f(t+1)=f(t)*[f’(t)/p(t)];
wherein f (t) represents the value of the factor function in the t-th round, t is a positive integer greater than or equal to 0, the initial value of t is 0, f (0) is the value of the initialized factor function, f' (t) represents the empirical distribution of the factor function in the t-th round, p (t) represents the sampling distribution of the factor function in the t-th round, and the empirical distribution and the sampling distribution are obtained in the iterative operation process of the EM algorithm.
6. The apparatus of claim 5, wherein the processing module comprises:
a first submodule, configured to determine, according to belief propagation and the factor function, a first probability distribution and a second probability distribution corresponding to the extended triplet, where the first probability distribution is used to represent a probability that the extended triplet is trustworthy, the second probability distribution is used to represent a probability that the extended triplet is untrustworthy, and the second probability distribution is 1 — the first probability distribution;
and the second submodule is used for determining whether the extension triple is credible according to a target probability distribution and a preset threshold, wherein the target probability distribution is the first probability distribution or the second probability distribution.
7. The apparatus of claim 6, wherein the second submodule is specifically configured to:
if the preset threshold is a credible threshold, the target probability distribution is a first probability distribution, and if the target probability distribution is greater than or equal to the preset threshold, the credibility of the extension triple is determined; if the target probability distribution is smaller than the preset threshold, determining that the extension triple is not credible;
if the preset threshold is an untrusted threshold, the target probability distribution is a second probability distribution, and if the target probability distribution is greater than or equal to the preset threshold, the extension triple is determined to be untrusted; and if the target probability distribution is smaller than the preset threshold, determining that the extension triple is credible.
8. The apparatus of any of claims 5-7, wherein the determining module is further configured to:
the iterative operation stops when the value of f (t) no longer changes.
CN201710011368.1A 2017-01-06 2017-01-06 Method and device for checking triple of knowledge base Active CN106874380B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710011368.1A CN106874380B (en) 2017-01-06 2017-01-06 Method and device for checking triple of knowledge base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710011368.1A CN106874380B (en) 2017-01-06 2017-01-06 Method and device for checking triple of knowledge base

Publications (2)

Publication Number Publication Date
CN106874380A CN106874380A (en) 2017-06-20
CN106874380B true CN106874380B (en) 2020-01-14

Family

ID=59164781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710011368.1A Active CN106874380B (en) 2017-01-06 2017-01-06 Method and device for checking triple of knowledge base

Country Status (1)

Country Link
CN (1) CN106874380B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569335B (en) 2018-03-23 2022-05-27 百度在线网络技术(北京)有限公司 Triple verification method and device based on artificial intelligence and storage medium
CN111506623B (en) * 2020-04-08 2024-03-22 北京百度网讯科技有限公司 Data expansion method, device, equipment and storage medium
CN113204650B (en) * 2021-05-14 2022-03-11 深圳市曙光信息技术有限公司 Evaluation method and system based on domain knowledge graph
CN113901151B (en) * 2021-09-30 2023-07-04 北京有竹居网络技术有限公司 Method, apparatus, device and medium for relation extraction

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100815563B1 (en) * 2006-08-28 2008-03-20 한국과학기술정보연구원 System and method for knowledge extension and inference service based on DBMS
CN103500208B (en) * 2013-09-30 2016-08-17 中国科学院自动化研究所 Deep layer data processing method and system in conjunction with knowledge base
CN104866498A (en) * 2014-02-24 2015-08-26 华为技术有限公司 Information processing method and device

Also Published As

Publication number Publication date
CN106874380A (en) 2017-06-20

Similar Documents

Publication Publication Date Title
CN106874380B (en) Method and device for checking triple of knowledge base
Hastie et al. Matrix completion and low-rank SVD via fast alternating least squares
Agarwal et al. Learning sparsely used overcomplete dictionaries
Kasai et al. Low-rank tensor completion: a Riemannian manifold preconditioning approach
Duan et al. Fast inverse-free sparse Bayesian learning via relaxed evidence lower bound maximization
CN110245269B (en) Method and device for acquiring dynamic embedded vector of node in relational network graph
Ahmed Controllability of impulsive neutral stochastic differential equations with fractional Brownian motion
Klopp et al. Adaptive multinomial matrix completion
Treister et al. A block-coordinate descent approach for large-scale sparse inverse covariance estimation
Donoho et al. How to design message passing algorithms for compressed sensing
Gabbay et al. Equilibrium states in numerical argumentation networks
Migliorati Adaptive approximation by optimal weighted least-squares methods
US10997528B2 (en) Unsupervised model evaluation method, apparatus, server, and computer-readable storage medium
Yan et al. Stochastic collocation algorithms using l_1-minimization for bayesian solution of inverse problems
Keller Dimensional regularization in position space and a forest formula for regularized Epstein-Glaser renormalization
Frommer et al. 2-norm error bounds and estimates for Lanczos approximations to linear systems and rational matrix functions
Dym et al. Exact recovery with symmetries for procrustes matching
Liu et al. New three-term conjugate gradient method for solving unconstrained optimization problems
Fox et al. On the causal interpretation of acyclic mixed graphs under multivariate normality
Barczy et al. Parameter estimation for a subcritical affine two factor model
Ruiz et al. Sequentially learning the topological ordering of causal directed acyclic graphs with likelihood ratio scores
Kukush et al. Simultaneous estimation of baseline hazard rate and regression parameters in Cox proportional hazards model with measurement error
CN114186583A (en) Method and system for recovering abnormal signal of corrosion detection of tank wall of oil storage tank
Ba-Brahem et al. The proposal of improved inexact isomorphic graph algorithm to detect design patterns
Gao et al. Nonparametric modeling and break point detection for time series signal of counts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant