CN106951963B

CN106951963B - Knowledge refining method and device

Info

Publication number: CN106951963B
Application number: CN201710197975.1A
Authority: CN
Inventors: 赵朋朋; 李春华; 许佳捷; 崔志明
Original assignee: Suzhou University
Current assignee: Suzhou Chuhui Intelligent Technology Co.,Ltd.
Priority date: 2017-03-29
Filing date: 2017-03-29
Publication date: 2020-05-22
Anticipated expiration: 2037-03-29
Also published as: CN106951963A

Abstract

The invention discloses a knowledge refining method and a knowledge refining device, which are characterized in that a candidate knowledge subset in an automatically extracted knowledge base is obtained; selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on semantic constraint rules, and the first preset number is less than or equal to the preset crowdsourcing task number; issuing crowdsourcing tasks based on the optimal knowledge subsets to obtain task feedback results; and denoising the knowledge base according to the task feedback result. Namely, the knowledge in the knowledge base which is automatically extracted is refined based on the crowdsourcing platform, namely, the noise of the knowledge base which is automatically extracted is removed by manual marking, so that the knowledge quality in the knowledge base is higher. And a preset number of candidate knowledge subsets are selected to implement crowdsourcing tasks, so that the improvement of knowledge quality is maximized under the condition of limited resources. Therefore, the method and the device are beneficial to improving the knowledge quality in the knowledge base with automatic extraction.

Description

Knowledge refining method and device

Technical Field

The invention relates to the field of machine learning, in particular to a knowledge refining method and a knowledge refining device.

Background

In recent years, machine learning techniques and natural language processing techniques have been applied to many information extraction systems. The information extraction system can automatically extract knowledge from massive Web data to construct a knowledge base.

The knowledge base formed by automatic extraction contains a large number of entities and entity relations, but due to the limitations of data sources and extraction algorithms used by an extraction system, the knowledge base is often noisy and unreliable. To improve the knowledge quality of the knowledge base, i.e. to remove noise in the knowledge base, knowledge algorithms may be used to reduce noise.

However, due to the large scale of the knowledge base, the information extraction system generally uses simple heuristic rules to make reasoning judgment on the uncertainty and contradiction of the knowledge so as to reduce the noise in the knowledge base. Furthermore, the knowledge base has the fact that the correctness of the knowledge algorithm is difficult to judge, and further, the processing capacity and the precision of the knowledge algorithm are very limited, so that more noise exists in the knowledge base, the reliability and the dependency of the knowledge base are lower, and the knowledge quality of the knowledge base is lower. In summary, how to improve the knowledge quality in the knowledge base of automatic extraction is an urgent problem to be solved in the art.

Disclosure of Invention

The invention aims to provide a knowledge refining method and a knowledge refining device, and aims to solve the problem that the knowledge quality in an automatically extracted knowledge base is low in the prior art.

In order to solve the technical problem, the invention provides a knowledge refining method, which comprises the following steps:

acquiring a candidate knowledge subset in an automatically extracted knowledge base;

selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number;

issuing a crowdsourcing task based on the optimal knowledge subset to obtain a task feedback result;

and carrying out denoising operation on the knowledge base according to the task feedback result.

Optionally, the selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm includes:

calculating to obtain a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence of the used knowledge extraction algorithm;

according to the contradictory relation semantic constraint rules in the semantic constraint rules, calculating to obtain a second numerical value representing the degree of contradiction of the candidate knowledge subsets;

calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset;

selecting the knowledge subsets with the first preset number from the candidate knowledge subsets according to the evaluation score, and taking the knowledge subsets as the optimal knowledge subsets;

wherein the uncertainty is a property that measures how easily the extraction algorithm determines the candidate knowledge subset to be the correct knowledge subset.

Optionally, after the selecting the first preset number of knowledge subsets from the candidate knowledge subsets according to the evaluation score further includes:

generating a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset;

taking each knowledge subset as a vertex, and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph;

selecting a second preset number of first optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertexes as optimal knowledge subsets;

the first optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, the second preset number is smaller than the first preset number, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.

Optionally, the performing, according to the task feedback result, a denoising operation on the knowledge base includes:

when the task feedback result is correct, the first optimal vertex corresponding to the task feedback result is colored to be a first color;

when the task feedback result is wrong, the first optimal vertex corresponding to the task feedback result is colored to be a second color;

according to the consistent relation semantic constraint rule of the semantic constraint rule and the contradictory relation semantic constraint rule, other vertexes are colored into the first color or the second color;

and removing the knowledge subsets corresponding to the vertices of the first directed graph with the second color.

Optionally, the selecting, according to a crowdsourcing task selection algorithm, a first preset number of optimal knowledge subsets from the knowledge subsets includes:

generating a second closed semantic constraint rule according to the semantic constraint rule and the candidate knowledge subset;

taking each candidate knowledge subset as a vertex, and connecting the vertices according to the second closed semantic constraint rule to obtain a second directed graph;

selecting a first preset number of second optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the second optimal vertexes as optimal knowledge subsets;

the second optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.

Optionally, the selecting, according to a preset vertex selection algorithm, the first preset number of second optimal vertices from the vertices includes:

dividing the second directed graph into a first sub-graph containing all consistent relations and a second sub-graph containing all contradictory relations according to the consistent relations and the contradictory relations among all vertexes;

decomposing the first subgraph into a set of disjoint paths, wherein any two of the disjoint paths have no common vertex;

analyzing the disjoint paths based on a binary search method to obtain a second optimal vertex;

and taking the vertex with the maximum confidence coefficient and the vertex with zero in-degree in the second subgraph as the second optimal vertex.

Further, the present invention provides an apparatus for knowledge refining, the apparatus comprising:

an acquisition module for acquiring a candidate knowledge subset in an automatically extracted knowledge base;

the optimal selection module is used for selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number;

the task implementation module is used for issuing crowdsourcing tasks based on the optimal knowledge subset to obtain task feedback results;

and the denoising module is used for carrying out denoising operation on the knowledge base according to the task feedback result.

Optionally, the optimal selection module includes:

the uncertainty calculation unit is used for calculating a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence coefficient of the used knowledge extraction algorithm;

the contradictory calculation unit is used for calculating a second numerical value representing the degree of the contradiction of the candidate knowledge subset according to the contradictory relation semantic constraint rules in the semantic constraint rules;

the evaluation unit is used for calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset;

a selecting unit, configured to select the knowledge subsets of the first preset number from the candidate knowledge subsets according to the evaluation scores, and use the knowledge subsets as the optimal knowledge subsets;

wherein the uncertainty is a property that measures how easily the candidate knowledge subset is determined to be correct by the decimation algorithm.

Optionally, the method further comprises:

a closed rule generating module, configured to generate a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset;

the directed graph establishing module is used for taking each knowledge subset as a vertex and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph;

the optimal vertex selection module is used for selecting a second preset number of first optimal vertices from the vertices according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertices as the optimal knowledge subsets;

Optionally, the denoising module includes:

the first coloring unit is used for coloring the color of the first optimal vertex corresponding to the task feedback result into a first color when the task feedback result is correct;

the second coloring unit is used for coloring the color of the first optimal vertex corresponding to the task feedback result into a second color when the task feedback result is wrong;

a third coloring unit, configured to color other vertices into the first color or the second color according to the consistent relationship semantic constraint rule of the semantic constraint rules and the contradictory relationship semantic constraint rule;

and the removing unit is used for removing the knowledge subset corresponding to the vertex of which the color on the first directed graph is the second color.

The invention provides a knowledge refining method and a device, which are characterized in that a candidate knowledge subset in an automatically extracted knowledge base is obtained; selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number; issuing crowdsourcing tasks based on the optimal knowledge subsets to obtain task feedback results; and denoising the knowledge base according to the task feedback result. And selecting a preset number of optimal knowledge subsets from the candidate knowledge subsets of the knowledge base to generate a crowdsourcing task, and denoising the knowledge base according to the crowdsourcing task, namely refining the knowledge in the automatically extracted knowledge base based on a crowdsourcing platform, namely removing the noise of the knowledge base by utilizing manual marking. The correctness of the knowledge subsets which are difficult to identify by the traditional automatic refining algorithm can be judged by utilizing manual marking, so that the noise in the knowledge base is less, and the knowledge quality is higher. And a preset number of candidate knowledge subsets are selected to implement crowdsourcing tasks, so that the improvement of knowledge quality can be maximized under the condition of limited resources. Therefore, the method and the device are beneficial to improving the knowledge quality in the knowledge base with automatic extraction.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of one embodiment of a knowledge refining method provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a pseudo-code implementation of a task selection algorithm based on ranking;

FIG. 3 is a directed graph model corresponding to the candidate knowledge subsets of Table 1;

FIG. 4 is a pseudo-code diagram of a graph-based shading algorithm;

FIG. 5 is a pseudo-code diagram of a single-path implementation algorithm of a path-based vertex selection algorithm;

FIG. 6 is a pseudo-code diagram of a multi-path implementation algorithm of a path-based vertex selection algorithm;

FIG. 7 is a pseudo-code diagram of a topology-based ordering vertex selection algorithm;

FIG. 8 is a pseudo-code diagram of a fault tolerant shading algorithm;

fig. 9 is a block diagram showing a configuration of a knowledge refining apparatus according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a specific implementation of a knowledge refining method according to an embodiment of the present invention, the method including the following steps:

step 101: acquiring a candidate knowledge subset in an automatically extracted knowledge base;

it should be noted that the candidate knowledge subset may refer to a fact in the knowledge base, and the fact may be stored in the knowledge base in a triple form, for example, (china, capital, beijing). The candidate knowledge subset may refer to all facts in the knowledge base or a part of the facts, and is not limited herein.

Step 102: selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number;

it is understood that the crowdsourcing task may refer to a task distributed by a crowdsourcing platform, and the task may be completed by human workers, that is, the crowdsourcing platform issues the crowdsourcing task and generates a corresponding question, for example, a question of "is italy a country" may be presented, and then the workers may give answers to the question to obtain corresponding rewards.

Each candidate knowledge subset may be generated into a corresponding crowdsourcing task to refine the knowledge base, regardless of cost. However, considering that there is a certain cost for refining knowledge by using the crowdsourcing task and the scale of the knowledge base extracted automatically is large, in order to maximize the quality of the knowledge base and reduce the cost as much as possible, that is, to maximize the quality of the knowledge by using limited human resources, a certain number of candidate knowledge subsets may be selected to generate the crowdsourcing tasks instead of generating the crowdsourcing tasks by using all the candidate knowledge subsets. Obviously, it is within the scope of the embodiments of the present invention to refine the knowledge base based on the crowdsourcing platform.

The preset crowdsourcing task quantity may be a crowdsourcing budget, that is, a maximum value of the number of crowdsourcing tasks issued when the refining knowledge base is preset. The preset crowdsourcing task quantity can be set according to actual conditions. The first preset number smaller than or equal to the preset crowdsourcing task number can mean that the number of actually issued crowdsourcing tasks should be smaller than or equal to the preset crowdsourcing task number, that is, the issued crowdsourcing tasks should be within the crowdsourcing budget.

It should be noted that the optimal knowledge subset may refer to a knowledge subset that is beneficial for improving the quality of the knowledge base.

The above-mentioned crowdsourcing task selection algorithm may refer to an algorithm for determining which portion of the subset of candidate knowledge is to be a crowdsourcing task. The method is based on semantic constraint rules, namely utilizes the consistency and contradiction of the semantic constraint rules to carry out reasoning judgment to implement the effectiveness of crowdsourcing tasks on the candidate knowledge subsets.

The semantic constraint rules can be learned from training data or obtained according to ontology constraints. And ontology constraints can be viewed as axioms or rules of first-order predicate logic. For example, an entity category constraint of "athlete is a person" may be expressed by the following rules: athhlete (x) > person (x). Similarly, City (x) ═! Person (x) may then indicate that the city is not a person.

While ontology constraints are mainly of seven types, including relationships between classes and relationships (e.g., birds are animals), exclusive relationships between classes and relationships (e.g., humans are not places), inverted relationships (e.g., Team hasflag and Play For Team), the domain and value range of relationships (e.g., the domain of the "city length" relationship is a city and the value range is a human), the mapping cardinality of relationships (e.g., a country can only have one capital city), antisymmetry (e.g., site a is located in site B, then site B cannot be located in a), and antisymmetry (e.g., site a cannot be located in site a itself).

To better represent the seven ontological constraints described above, corresponding coincidences can be used for representation. For example, Sub may be used to represent the containment relationship of a category, RSub represents the containment relationship of a relationship; mut represents the mutual exclusion relationship of the categories, and RMut represents the mutual exclusion relationship of the relationships; inv denotes an inverse relationship; dom represents the definition domain of the relation, Ran represents the value domain of the relation; fun represents that the mapping base number of the relationship is one-to-one; AntiSym denotes the antisymmetry of the relationship; AntiRef represents the inverse of a relationship.

Further, Cat (x, y) and Rel (x, y, r) may be used to represent category and attribute relationships, respectively. Namely Cat (x, c) indicates that x is an entity of a category c, and Rel (x, y, r) indicates that x and y have a relationship r. For example, facts may be expressed as Cat (Tiger Words, Athlete), and Rel (Lakers, Basketball, TeamplaysSports).

According to truth relationships of facts, we divide semantic constraints into two types: contradictory relationships and consistent relationships. Contradictory relationships may be used as a basis for inference, i.e., if a subset of candidate knowledge is true, it may be inferred that another subset of candidate knowledge must not be true. According to the ontology constraint, the following 9 semantic constraint rules of contradiction relationship can be obtained, which are respectively:

for the first contradictory relationship semantic constraint rule, Mut (c1, c2) means that c1 does not belong to c2, Cat (x.c1) means that x belongs to c1, and then it can be concluded that x does not belong to c 2. Similarly, the specific meanings of the semantic constraint rules of other contradictory relationships can be inferred accordingly, and are not further described herein.

While a consensus relationship may mean that if one subset of candidate knowledge is true, it may be inferred that another subset of candidate knowledge is also necessarily true. According to the ontology constraint, 5 consistent relation semantic constraint rules can be obtained, which are respectively as follows:

wherein, for the semantic constraint rule 10, Sub (c1, c2) indicates that c1 belongs to c2, Cat (x.c1) indicates that x belongs to c1, it can be deduced that x belongs to c2, namely Cat (x, c 2). Similarly, the specific meanings of the semantic constraint rules of other consistency relationships can be inferred correspondingly, and are not described herein again.

On the basis of the semantic constraint rule, a crowdsourcing task selection algorithm based on sequencing can be generated, namely, each candidate knowledge subset is scored by using the uncertainty and the contradiction of the knowledge subsets to obtain a corresponding evaluation score, a certain number of candidate knowledge subsets are selected according to the evaluation score, and crowdsourcing tasks are implemented; the method can also generate a crowdsourcing task selection algorithm based on the graph, namely modeling the candidate knowledge subsets to obtain a directed graph, wherein each candidate knowledge subset is used as a vertex, all the vertices are connected according to consistency relations and contradiction relations among the candidate knowledge subsets, then a certain number of optimal vertices are selected from the directed graph, and the candidate knowledge subsets corresponding to the optimal vertices are used as knowledge subsets for implementing crowdsourcing tasks.

It is to be understood that the above-described crowdsourcing task selection algorithm may refer to the ranking-based crowdsourcing task selection algorithm alone; or may refer to a graph-based crowd-sourced task selection algorithm alone; the knowledge subset selection method can also refer to an algorithm combining a crowdsourcing task selection algorithm based on sequencing and a crowdsourcing task selection algorithm based on a graph, at the moment, after a certain number of knowledge subsets are selected by the crowdsourcing task selection algorithm based on sequencing, the correctness of other knowledge subsets is reasoned by further utilizing a consistent relation semantic constraint rule of the semantic constraint rule, namely, the knowledge subsets selected by the crowdsourcing task selection algorithm based on sequencing are subjected to graph modeling, and then the knowledge subsets are further screened by utilizing the crowdsourcing task selection algorithm based on sequencing.

Step 103: issuing a crowdsourcing task based on the optimal knowledge subset to obtain a task feedback result;

specifically, the crowdsourcing platform may generate crowdsourcing tasks according to each selected optimal knowledge subset, and issue the crowdsourcing tasks, that is, generate corresponding questions according to the optimal knowledge subsets, and then receive answers of the tasks, that is, task feedback results.

Step 104: and carrying out denoising operation on the knowledge base according to the task feedback result.

It should be noted that the above-mentioned denoising operation may refer to removing some incorrect knowledge subsets, and in this case, the incorrect knowledge subsets or unreliable knowledge subsets are used as the noise of the knowledge base.

It will be appreciated that when the crowd-sourced task selection algorithm is a graph-based task selection algorithm, the de-noising process described above may be converted to a graph vertex coloring process, i.e., each vertex in the directed graph is colored with a corresponding color, e.g., vertices where the subset of candidate knowledge is incorrect may be colored red, while vertices where the subset of candidate knowledge is correct may be colored green. After all vertices are colored, vertices that are red in color can be removed, vertices that are green in color are retained, i.e., the wrong knowledge subset is removed, and the correct knowledge subset is retained.

The knowledge refining method provided by the embodiment of the invention obtains the candidate knowledge subsets in the automatically extracted knowledge base; selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number; issuing crowdsourcing tasks based on the optimal knowledge subsets to obtain task feedback results; and denoising the knowledge base according to the task feedback result. And selecting a preset number of optimal knowledge subsets from the candidate knowledge subsets of the knowledge base to generate a crowdsourcing task, and denoising the knowledge base according to the crowdsourcing task, namely refining the knowledge in the automatically extracted knowledge base based on a crowdsourcing platform, namely removing the noise of the knowledge base by utilizing manual marking. The correctness of the knowledge subsets which are difficult to identify by the traditional automatic refining algorithm can be judged by utilizing manual marking, so that the noise in the knowledge base is less, and the knowledge quality is higher. And a preset number of candidate knowledge subsets are selected to implement crowdsourcing tasks, so that the improvement of knowledge quality can be maximized under the condition of limited resources. It can be seen that the method is beneficial to improving the knowledge quality in the automatically extracted knowledge base.

Since the crowdsourcing task selection algorithm may be any one of a ranking-based task selection algorithm and a graph-based task selection algorithm, in order to better describe a specific process of the algorithm, the ranking-based task selection algorithm will be first described in detail below.

The specific implementation process of the task selection algorithm based on the ranking can be seen in fig. 2, and fig. 2 is a schematic diagram of a pseudo code implementation of the task selection algorithm based on the ranking.

Therefore, on the basis of the above embodiment, the above process of selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to the crowdsourcing task selection algorithm may specifically be: calculating to obtain a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence of the used knowledge extraction algorithm; according to the contradictory relation semantic constraint rules in the semantic constraint rules, calculating to obtain a second numerical value representing the degree of contradiction of the candidate knowledge subsets; calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset; selecting the knowledge subsets with the first preset number from the candidate knowledge subsets according to the evaluation score, and taking the knowledge subsets as the optimal knowledge subsets; wherein the uncertainty is a property that measures how easily the extraction algorithm determines the candidate knowledge subset to be the correct knowledge subset.

It should be noted that the preset threshold may be a numerical value used to define whether the knowledge subset is correct, and the candidate knowledge subset with the confidence level greater than the preset threshold may be considered to be correct, and the candidate knowledge subset with the confidence level less than the preset threshold may be considered to be incorrect.

The confidence degree may refer to the confidence degree of the extraction algorithm, and the confidence degree values may also be different according to different extraction algorithms, and the confidence degree is not necessarily equal to the probability value. At this time, when the confidence coefficient is not equal to the probability value, the probability value can be recalculated, that is, a logic regression model can be obtained by learning by using a labeled training data set, and then the logic regression model is used for predicting whether the candidate knowledge subset is the true probability value; without the training data, the confidence provided by the information extraction system can be directly used as the probability value.

The first value may refer to a value characterizing Uncertainty of the candidate knowledge subset, which may be specifically represented by the formula unrnterainty (t)_i)＝1-|conf_m(t_i) -T | (15) is calculated. Wherein T is a preset threshold value, T_iAs a candidate knowledge subset, conf_m(t_i) Represents t_iConfidence of the candidate knowledge subset, Uncertainty (t)_i) Is the uncertainty value of the candidate knowledge subset ti.

While the uncertainty of a candidate knowledge subset may refer to the ease with which a metric algorithm determines its correctness. The higher the uncertainty of a candidate knowledge subset, the more uncertain it is, i.e. the higher the difficulty of determining the correctness of the candidate knowledge subset by an algorithm.

The second value may refer toThe numerical value characterizing the inconsistency of the candidate knowledge subset, i.e., the inconsistency of the candidate knowledge subset, may refer to a measure of the error risk of the candidate knowledge subset and its degree of importance. Which may be embodied by the formula

And (4) calculating. Wherein n is_j(t_i) Refers to the fact t_iViolation of semantic constraint rule F_jThe number of closure rules of (1) is to ensure that the score is greater than zero, Contraditoroiness (t)_i) As a candidate knowledge subset t_iThe contradictory values of (a).

It will be appreciated that the more closed rules of the semantic constraint rules a candidate knowledge subset violates, the greater the likelihood of its error. For example, if there are semantic constraint rules Mut ((counter, bird) and Mut ((city, bird), and candidate knowledge subsets Cat (Italy, country), Cat (Italy, city) and Cat (Italy, bird), for candidate knowledge subset Cat (Italy, bird), which violates the closed rules of the two semantic constraint rules Mut ((counter, bird) and Mut ((city, bird), then n1(Cat (Italy, bird)) > 2.

The preset evaluation function may refer to a function for evaluating effectiveness of each candidate knowledge subset for improving quality of the knowledge base based on uncertainty and contradiction of the candidate knowledge subsets. It may be embodied as Score (t)_i)＝Uncertainty(t_i)*Contradictoriness(t_i)(17)。

According to the evaluation scores of the candidate knowledge subsets, a preset number of knowledge subsets can be selected as the optimal knowledge subset to implement the crowdsourcing task. Specifically, the candidate knowledge subsets may be sorted according to the evaluation scores, and may be sorted from high to low, or sorted from low to high. In general, the top K candidate knowledge subsets with the largest evaluation scores may be selected to perform the crowdsourcing task. Of course, the value of K should be smaller than or equal to the predetermined crowdsourcing task number.

It can be seen that the knowledge refining method provided by the embodiment of the invention selects the optimal knowledge subset within the range of the crowdsourcing budget by using the task selection algorithm based on the sequencing, implements the crowdsourcing task, and can maximally improve the knowledge quality in the automatically extracted knowledge base by using limited human resources.

Since the crowdsourcing task selection algorithm may be any one of a ranking-based task selection algorithm and a graph-based task selection algorithm, the graph-based task selection algorithm will be described in detail below in order to better describe a specific process of the algorithm.

Therefore, on the basis of the above embodiment, the above process of selecting the first preset number of optimal knowledge subsets from the knowledge subsets according to the crowdsourcing task selection algorithm may specifically be: generating a second closed semantic constraint rule according to the semantic constraint rule and the candidate knowledge subset; taking each candidate knowledge subset as a vertex, and connecting the vertices according to the second closed semantic constraint rule to obtain a second directed graph; selecting a first preset number of second optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the second optimal vertexes as optimal knowledge subsets; the second optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.

It should be noted that the second closed semantic constraint rule may be obtained according to the semantic rule and the given candidate knowledge subset. For example, given a semantic constraint rule such as Table 1-1, and a candidate knowledge subset such as Table 1-2, a closed semantic constraint rule such as Table 1-3 can be derived.

Ontological Relations
	Domain(ceoof，ceo)
Ran(ceoof，company)
	Sub(ceo，person)
RSub(ceoof，topmemberoforganization)
	RSub(topmemberoforganization，worksfor)
RSub(topmemberoforganization，personleadsorganization)
	RSub(worksfor，personbelongstoorganization)
RMut(topmemberoforganization，organizationleadbyperson)

TABLE 1-1

Tables 1 to 2

Grounding Rules
	t₁＝>t₂
t₁＝>t₃
	t₁＝>t₄
t₂＝>t₇
	t₄＝>t₆
t₄＝>t₇
	t₆＝>t₈
t₈＝>t₇
	t₅＝>！t₆
t₆＝>！t₅

Tables 1 to 3

The second directed graph may refer to a graph model of the candidate knowledge subsets, each vertex of the graph model corresponds to one candidate knowledge subset, and a connecting line between each vertex in the graph represents a consistent relationship or a contradictory relationship between each candidate knowledge subset. For example, referring to fig. 3, fig. 3 is a directed graph model corresponding to the candidate knowledge subsets in table 1.

As shown in fig. 3, there is t₁、t₂、t₃、t₄、t₅、t₆、t₇And t₈Wait to select a subset of knowledge, at which time t corresponds to₁、t₂、t₃、t₄、t₅、t₆、t₇And t₈And 8 vertices are equal. According to the closed semantic constraint rules in tables 1-3, the vertices can be connected by solid arrows or dashed arrows to obtain the corresponding directed graph model. Obviously, the dashed arrow connecting lines in the figure represent contradictory relationships between two vertices, and the solid arrow connecting lines represent consistent relationships between two vertices.

Building a graphAfter the model is built, the graph can be divided into a consistent subgraph and a contradictory subgraph according to the relations in the graph, and G is used for each graph_pAnd G_cAnd (4) showing. Wherein G is_p＝(V，E_p) The representation only contains edge E_pSubfigure of (1), G_c＝(V，E_c) The representation only contains edge E_cIs shown in the figure. And E_pRepresenting all consistent relationships, E_cAll contradictory relationships are represented. For example, for two candidate knowledge subsets t_iAnd t_jIf t is_i＝>t_jThen there is a directed edge E ∈ E_pFrom t_iTo t_jRepresenting the consistent relationship; if t is_i＝>！t_jThen there is a directed edge E ∈ E_cFrom t_iTo t_jThe contradictory relationships are expressed.

The candidate knowledge subsets are modeled into a graph, at this time, the knowledge refining problem can be converted into a graph coloring problem, namely, each vertex in the graph is colored according to a certain coloring strategy, and after all the vertices in the graph are colored, the correctness conditions of all the candidate knowledge subsets are also known.

In general, there are two possibilities for the candidate knowledge subset, correct and incorrect, and accordingly, there are two possibilities for the vertices in the graph. To better distinguish the cases of the individual vertices, the two possibilities of the vertices can be distinguished with different colors. For example, vertices corresponding to the correct candidate knowledge subset may be colored green and vertices corresponding to the incorrect candidate knowledge subset may be colored red.

Due to the fact that the knowledge base is large in size and the number of the included candidate knowledge subsets is large, all the candidate knowledge subsets cannot be subjected to crowdsourcing tasks, and the number of the crowdsourcing tasks can be reduced by means of an effective coloring framework. The coloring process may be implemented by a coloring algorithm, please refer to fig. 4, and fig. 4 is a pseudo code diagram of a graph-based coloring algorithm.

The graph coloring algorithm can firstly construct a graph model by using a closed constraint rule (1-2 lines of pseudo codes in the graph), and then select an uncolored vertex t_i(line 3 pseudo-code in the figure). Will wait againSelecting a subset of knowledge t_iAnd implementing the crowdsourcing task, and coloring according to the crowdsourcing task result. If t_iIf it is correct, then not only the vertex t can be replaced_iIs colored green, and t can be colored_iAt G_pAll sub-vertices d in (1)_p(t_i) Is colored green, and t is_iAnd d_p(t_i) At G_cAnd the sub-vertices in G_pThe parent vertex in (1) is colored red (lines 6-8 pseudo code in the figure). That is, for t_i＝>t_jT can be deduced_jIs also true; for t_i＝>！t_jT can be deduced_jIs incorrect. If t_iNot only can t be incorrect_iColoring red, and optionally coloring t_iAt G_pAll parent vertices in (b) are colored red (line 10 pseudo-code in the figure). When all vertices have been colored, the coloring process ends. Otherwise, the shading algorithm selects the next uncolored vertex and repeats the above steps.

It will be appreciated that in order to make reasonable use of the crowdsourcing budget, i.e. to keep the number of crowdsourcing tasks within the crowdsourcing budget, to reduce the crowdsourcing cost, as few candidate subsets of knowledge may be selected to perform the crowdsourcing tasks as possible, i.e. as few vertices as possible are selected from the multitude of vertices in the graph, so that all vertices in the graph may be colored.

To better describe the optimal vertices, boundary vertices may be introduced, which may be identical to the boundary vertices. And the definition of the boundary vertices may be: a vertex is a boundary vertex if the color of the vertex cannot be inferred from the colors of other vertices.

The second optimal vertex may refer to a boundary vertex in the second directed graph. The optimal vertex selection method can be a vertex selection algorithm based on a path and a vertex selection algorithm based on topological sorting.

In some embodiments of the present invention, the process of selecting the first preset number of second optimal vertices from the vertices according to the preset vertex selection algorithm may specifically be: dividing the second directed graph into a first sub-graph containing all consistent relations and a second sub-graph containing all contradictory relations according to the consistent relations and the contradictory relations among all vertexes; decomposing the first subgraph into a set of disjoint paths, wherein any two of the disjoint paths have no common vertex; analyzing the disjoint paths based on a binary search method to obtain a second optimal vertex; and taking the vertex with the maximum confidence coefficient and the vertex with zero in-degree in the second subgraph as the second optimal vertex.

It will be appreciated that the first sub-figure referred to above may refer to G as referred to above_pFor a detailed description, please refer to the above corresponding contents, which are not described herein again.

The second sub-figure may refer to G mentioned hereinbefore_cFor a detailed description, please refer to the above corresponding contents, which are not described herein again.

For convenience of subsequent description, V may be used_BTo represent the set of boundary vertices, V, in graph G_B(G_p) And V_B(G_c) Representing the boundary vertices of subgraphs Gp and Gc, respectively.

It can be understood that V_BThe vertex color in (1) can be neither according to G_pA reasoning is obtained, nor can it be based on G_cAnd (6) reasoning to obtain. And V_BThe vertex colors in (1) all have to be labeled manually because of V_BThe color of the vertex in (1) cannot be derived by inference. However, since the truth values of the vertices in the graph are unknown, these boundary vertices cannot be identified in advance. To solve this problem, G with theoretical upper bound guarantees is proposed_pBoundary vertex recognition algorithm and greedy-based G_cAnd (4) a boundary vertex identification algorithm. At the same time, since in general G_cWith a greater number of boundary vertices, i.e. at G_cRatio of influence of the middle boundary vertices to G_pMore limited, G can be considered as a priority_pBoundary vertices to reduce the number of candidate vertices.

G_pThe specific process of the boundary vertex identification algorithm may be as follows: selecting middle vertex of path, determining next step according to crowdsourcing result of middle vertex, if middle vertex pairIf the corresponding candidate knowledge subset is correct, the color of its child vertex can be inferred, but the color of its parent vertex cannot be inferred. Thus, the next crowd-sourced vertex can be selected as the intermediate vertex between the current vertex and the path start vertex. If the candidate knowledge subset corresponding to the intermediate vertex is wrong, the color of the parent vertex can be obtained by inference, but the color of the child vertex cannot be obtained by inference. Thus, the next crowd-sourced vertex can be selected as the intermediate vertex between the current vertex and the path terminating vertex. By such iteration, all boundary vertices can be found. Assuming the number of vertices on path P is | P |, then the number of crowd-sourced vertices is O (log | P |).

And greedy based G_cThe specific process of the boundary vertex identification algorithm may be as follows: greedy selection of G_cA vertex with a degree of medians of 0 and a set of contradictory vertices (i.e., G)_cCluster of (d) the most confident vertex.

Path-based vertex selection algorithm first computes G by maximum matching_pSelecting an optimal vertex on the longest path in the disjoint paths by using a binary search method to execute a crowdsourcing task, and coloring the vertex in the graph by using a coloring strategy; the colored vertices are then removed from the graph and the above steps are repeated. And when no path length is larger than 1, selecting an optimal vertex according to the Gc to execute the crowdsourcing task. The specific implementation process can refer to the algorithm in fig. 5, and fig. 5 is a pseudo code diagram of a single-path implementation algorithm of a path-based vertex selection algorithm.

It will be appreciated that the graph G may be_pDecomposed into a set of disjoint paths (i.e., any two paths do not have a vertex in common), the maximum length of a path is set to | V |, so the number of crowd-sourced vertices is O (β log | V |), where β is the number of disjoint paths.

While to find disjoint paths, graph G may be_pConverted into a bipartite graph

Wherein

If there is an edge (v)₁,v₂)∈E_pThen at v₁∈V₁ ^bAnd

there is also an edge between them.

The maximum match may mean that any two edges are at

And

there is no maximum set of edges in the vertex, i.e., any two edges (v, v ') and (u, u '), v ≠ u and v ≠ u ', in the maximum match.

For convenience of description, the maximum match may be represented by M, Y₁Representing the first set of vertices in M, Y₂Representing the second set of vertices in M. While

Representing a set of vertices without in-degree, the vertex without in-degree may be taken as the first vertex of the path. For each such vertex v, if it has an edge (v, v '), let v' be the second vertex of the path; then, it is checked whether v 'has an edge (v', v "). By such iteration, a path starting at v can be found. And paths calculated by maximum matching have disjointness, completeness and minimization.

While a path-based vertex selection algorithm may only publish a single subset of candidate knowledge to the crowdsourcing platform at a time, significant latency may occur. To overcome this drawback, a path-based algorithm may be made to support parallel processing, such that multiple vertices may be selected during each iteration while performing a crowdsourcing task. The specific implementation process can be realized by the algorithm shown in fig. 6, and fig. 6 is based onPseudo code schematic of a multi-path implementation algorithm of a path vertex selection algorithm. The algorithm first calculates G_pβ, selecting an optimal vertex from each path by using an optimal vertex selection strategy, simultaneously releasing candidate facts corresponding to the multiple selected vertices to a crowdsourcing platform, coloring the vertices in the graph according to crowdsourcing answers, removing the colored vertices, and repeating the steps until all the vertices are colored.

It is considered that the multi-path concurrent algorithm may generate conflict in the coloring process. For example, assume t_iColoured green, t_jColoring red if t_i＝>t and t ═>t_jThen a conflict will arise when coloring t, since according to t_iReasoning yields that t is green and according to t_jReasoning gave t as red. A majority voting mechanism can be used to resolve this conflict, and when there are as many votes in two colors, one of the two colors is randomly chosen to be colored.

Since the path-based vertex selection algorithm needs to calculate the complexity of the maximum matching, the complexity in practical application is slow for a large-scale knowledge base. To solve this problem, a vertex selection algorithm based on topological ordering is proposed. The algorithm first finds a vertex set with an in-degree of 0, denoted as L₁(ii) a Then, the vertices are deleted from the graph and another vertex set with the in degree of 0, which is marked as L, is found₂(ii) a The above steps are repeated until all vertices are deleted. Obviously, subset L_iIs 0, so each L_iCan be viewed as a set of independent vertices.

The specific implementation process of the vertex selection algorithm based on the topological sorting may be as shown in fig. 7, and fig. 7 is a pseudo code diagram of the vertex selection algorithm based on the topological sorting. The algorithm obtains G through calculation_pTopological ordered vertex set L₁，L₂，…L_|L|Parallel crowdsourcing of intermediate layer vertex sets

When the absolute value L is less than or equal to 1,according to G_cA crowd-sourced vertex is selected. G is then colored according to the crowd-sourced answers and the coloring policy, removing the colored vertices. The above steps are repeated until all vertices have been colored.

It can be seen that the knowledge refining method provided by the embodiment of the invention utilizes the task selection algorithm based on the graph to reason the correctness of the relevant candidate knowledge subsets through the semantic constraint rule, thereby further reducing the number of candidate knowledge subsets for implementing the crowdsourcing task and further reducing the cost of knowledge refining. Meanwhile, by selecting the optimal vertexes as few as possible, namely selecting the candidate knowledge subsets which can be reduced as soon as possible to implement the crowdsourcing task, the minimum human resources can be utilized, and the improvement of the knowledge quality in the knowledge base is maximized.

Since the crowdsourcing task selection algorithm may be any one of a task selection algorithm based on ranking and a task selection algorithm based on a graph, or the task selection algorithm based on ranking and the task selection algorithm based on a graph may be combined, in order to better introduce a specific process of the algorithm, the crowdsourcing task selection algorithm after the task selection algorithm based on ranking and the task selection algorithm based on a graph are combined will be introduced below.

Therefore, on the basis of the above-described embodiment of the ranking-based task selection algorithm, after selecting the first preset number of knowledge subsets from the candidate knowledge subsets according to the evaluation scores, the method may further include: generating a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset; taking each knowledge subset as a vertex, and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph; selecting a second preset number of first optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertexes as optimal knowledge subsets; the first optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, the second preset number is smaller than the first preset number, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.

It should be noted that, in view of the fact that the graph-based crowdsourcing task selection algorithm can perform inference by using the consistent relation semantic constraint rule and the contradictory relation semantic constraint rule to determine the correctness of some optimal knowledge subsets, the graph-based crowdsourcing task selection algorithm can be used to perform inference judgment on the selected optimal knowledge subset after the optimal knowledge subset is selected by using the ranking-based crowdsourcing task selection algorithm, so as to further reduce crowdsourcing tasks and reduce cost.

At this time, the knowledge subsets selected by the sorting-based crowdsourcing task selection algorithm can be modeled, that is, each selected knowledge subset is used as a vertex, and then the correctness of each knowledge subset is inferred according to the semantic constraint rule.

Obviously, after the graph model is built, the subsequent processing steps are similar to the individual graph-based crowdsourcing task selection algorithm, so the related processing steps can be referred to the introduction of the graph-based crowdsourcing task selection algorithm, and are not described herein again.

The knowledge refining problem can be converted into the graph coloring problem by the graph-based crowdsourcing task selection algorithm, so in some embodiments of the present invention, the process of performing denoising operation on the knowledge base according to the task feedback result may specifically be: when the task feedback result is correct, the first optimal vertex corresponding to the task feedback result is colored to be a first color; when the task feedback result is wrong, the first optimal vertex corresponding to the task feedback result is colored to be a second color; according to the consistent relation semantic constraint rule of the semantic constraint rule and the contradictory relation semantic constraint rule, other vertexes are colored into the first color or the second color; and removing the knowledge subsets corresponding to the vertices of the first directed graph with the second color.

The first color and the second color may be arbitrarily set, and for example, the first color may be set to green, and the second color may be set to red.

Considering that a graph-based crowd-sourced task selection algorithm may be in error, one error is that caused by a worker's mistake, e.g., assuming some candidate knowledge subset t_iIs incorrect, but the worker incorrectly marks it as correct, then this error is due to the worker's mistake; another type of error is an error propagated through inference rules, e.g., assuming a sum of t_iContradictory another candidate knowledge subset t_jTrue values are correct, and the graph-based crowd-sourced task selection algorithm incorrectly assigns t according to the inference rule_jLabeled as incorrect, this error is propagated through the inference rules.

To tolerate worker errors, each candidate knowledge subset may be assigned to multiple workers, and the answers of the multiple workers may then be combined to derive a confidence level for the worker answers of a candidate knowledge subset. For example, assume that each candidate knowledge subset is assigned to z workers,

if a worker has delivered a consistent answer (e.g., "correct"), and z-y workers have delivered another answer (e.g., "incorrect"), then the confidence level of the worker's answers is

To overcome the errors propagated through inference rules, logistic regression models can be used to determine the uncertain vertex colors. Generally, if the confidence of worker answers for a candidate subset of knowledge is greater than a certain threshold (e.g., greater than 0.8), then the vertices in the graph may be colored using a coloring policy. Otherwise, it is colored first to a third color (e.g., blue) and the vertex is not utilized to color other vertices; the other colored vertices are then used as true values and used to color the vertices of a third color (i.e., indeterminate vertices). In particular, a logistic regression model of the confidence of the candidate knowledge subsets (the confidence provided by the information extraction system) may be trained using the colored vertices and used to predict the color of the uncertain vertices. A specific fault-tolerant shading algorithm is shown in fig. 8, and fig. 8 is a pseudo code diagram of the fault-tolerant shading algorithm. The algorithm only colors other vertices (line 5 code in the figure) using a coloring strategy when the vertices have high-confidence answers, and finally colors the uncertain vertices using a logistic regression model (lines 7-16 code in the figure).

It can be seen that the knowledge refining method provided by the embodiment of the invention further utilizes the crowd-sourced task selection algorithm based on the graph to reason the knowledge subset on the basis of the sorted crowd-sourced task selection algorithm, so as to further reduce the crowd-sourced tasks and reduce the cost. Meanwhile, a fault-tolerant coloring processing technology is provided, and the accuracy of image coloring is improved.

The knowledge refining apparatus provided by the embodiment of the present invention is described below, and the knowledge refining apparatus described below and the knowledge refining method described above may be referred to in correspondence with each other.

Fig. 9 is a block diagram of a knowledge refining apparatus according to an embodiment of the present invention, and with reference to fig. 9, the knowledge refining apparatus may include:

an obtaining module 901, configured to obtain a candidate knowledge subset in an automatically extracted knowledge base;

an optimal selection module 902, configured to select, according to a crowdsourcing task selection algorithm, a first preset number of optimal knowledge subsets from the candidate knowledge subsets, where the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to a preset crowdsourcing task number;

a task implementation module 903, configured to issue a crowdsourcing task based on the optimal knowledge subset to obtain a task feedback result;

and the denoising module 904 is configured to perform denoising operation on the knowledge base according to the task feedback result.

Optionally, the optimal selection module includes:

Optionally, the method further comprises:

Optionally, the denoising module includes:

The knowledge refining device provided by the embodiment of the invention selects a preset number of optimal knowledge subsets from the candidate knowledge subsets of the knowledge base to generate the crowdsourcing task, and denoises the knowledge base according to the result of the crowdsourcing task, namely, the knowledge in the knowledge base which is automatically extracted is refined based on a crowdsourcing platform, namely, the noise of the knowledge base is removed by utilizing manual marking. The correctness of the knowledge subsets which are difficult to identify by the traditional automatic refining algorithm can be judged by utilizing manual marking, so that the noise in the knowledge base is less, and the knowledge quality is higher. And a preset number of candidate knowledge subsets are selected to implement crowdsourcing tasks, so that the improvement of knowledge quality can be maximized under the condition of limited resources. It can be seen that the apparatus is advantageous for improving the knowledge quality in the automatically extracted knowledge base.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The method and apparatus for refining knowledge provided by the present invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A method of knowledge refining, comprising:

acquiring a candidate knowledge subset in an automatically extracted knowledge base; the knowledge base is constructed by Web data in an information extraction system;

selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number; the optimal knowledge subset refers to a knowledge subset capable of improving the quality of a knowledge base;

2. The method of claim 1, wherein said selecting a first preset number of optimal knowledge subsets from said candidate knowledge subsets according to a crowdsourcing task selection algorithm comprises:

3. The method of claim 2, wherein after said selecting said first predetermined number of subsets of knowledge from said candidate subsets of knowledge based on said evaluation score further comprises:

4. The method of claim 3, wherein said denoising the knowledge base according to the task feedback result comprises:

5. The method of claim 1, wherein said picking a first preset number of optimal knowledge subsets from said knowledge subsets according to a crowdsourcing task selection algorithm comprises:

6. The method of claim 5, wherein said selecting said first predetermined number of second optimal vertices from said vertices according to a predetermined vertex selection algorithm comprises:

7. An apparatus for knowledge refining, comprising:

an acquisition module for acquiring a candidate knowledge subset in an automatically extracted knowledge base; the knowledge base is constructed by Web data in an information extraction system;

the optimal selection module is used for selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number; the optimal knowledge subset refers to a knowledge subset capable of improving the quality of a knowledge base;

8. The apparatus of claim 7, wherein the optimal selection module comprises:

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 9, wherein the denoising module comprises: