CN106951963B - Knowledge refining method and device - Google Patents

Knowledge refining method and device Download PDF

Info

Publication number
CN106951963B
CN106951963B CN201710197975.1A CN201710197975A CN106951963B CN 106951963 B CN106951963 B CN 106951963B CN 201710197975 A CN201710197975 A CN 201710197975A CN 106951963 B CN106951963 B CN 106951963B
Authority
CN
China
Prior art keywords
knowledge
vertex
optimal
subsets
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710197975.1A
Other languages
Chinese (zh)
Other versions
CN106951963A (en
Inventor
赵朋朋
李春华
许佳捷
崔志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Chuhui Intelligent Technology Co.,Ltd.
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201710197975.1A priority Critical patent/CN106951963B/en
Publication of CN106951963A publication Critical patent/CN106951963A/en
Application granted granted Critical
Publication of CN106951963B publication Critical patent/CN106951963B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/027Frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a knowledge refining method and a knowledge refining device, which are characterized in that a candidate knowledge subset in an automatically extracted knowledge base is obtained; selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on semantic constraint rules, and the first preset number is less than or equal to the preset crowdsourcing task number; issuing crowdsourcing tasks based on the optimal knowledge subsets to obtain task feedback results; and denoising the knowledge base according to the task feedback result. Namely, the knowledge in the knowledge base which is automatically extracted is refined based on the crowdsourcing platform, namely, the noise of the knowledge base which is automatically extracted is removed by manual marking, so that the knowledge quality in the knowledge base is higher. And a preset number of candidate knowledge subsets are selected to implement crowdsourcing tasks, so that the improvement of knowledge quality is maximized under the condition of limited resources. Therefore, the method and the device are beneficial to improving the knowledge quality in the knowledge base with automatic extraction.

Description

Knowledge refining method and device
Technical Field
The invention relates to the field of machine learning, in particular to a knowledge refining method and a knowledge refining device.
Background
In recent years, machine learning techniques and natural language processing techniques have been applied to many information extraction systems. The information extraction system can automatically extract knowledge from massive Web data to construct a knowledge base.
The knowledge base formed by automatic extraction contains a large number of entities and entity relations, but due to the limitations of data sources and extraction algorithms used by an extraction system, the knowledge base is often noisy and unreliable. To improve the knowledge quality of the knowledge base, i.e. to remove noise in the knowledge base, knowledge algorithms may be used to reduce noise.
However, due to the large scale of the knowledge base, the information extraction system generally uses simple heuristic rules to make reasoning judgment on the uncertainty and contradiction of the knowledge so as to reduce the noise in the knowledge base. Furthermore, the knowledge base has the fact that the correctness of the knowledge algorithm is difficult to judge, and further, the processing capacity and the precision of the knowledge algorithm are very limited, so that more noise exists in the knowledge base, the reliability and the dependency of the knowledge base are lower, and the knowledge quality of the knowledge base is lower. In summary, how to improve the knowledge quality in the knowledge base of automatic extraction is an urgent problem to be solved in the art.
Disclosure of Invention
The invention aims to provide a knowledge refining method and a knowledge refining device, and aims to solve the problem that the knowledge quality in an automatically extracted knowledge base is low in the prior art.
In order to solve the technical problem, the invention provides a knowledge refining method, which comprises the following steps:
acquiring a candidate knowledge subset in an automatically extracted knowledge base;
selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number;
issuing a crowdsourcing task based on the optimal knowledge subset to obtain a task feedback result;
and carrying out denoising operation on the knowledge base according to the task feedback result.
Optionally, the selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm includes:
calculating to obtain a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence of the used knowledge extraction algorithm;
according to the contradictory relation semantic constraint rules in the semantic constraint rules, calculating to obtain a second numerical value representing the degree of contradiction of the candidate knowledge subsets;
calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset;
selecting the knowledge subsets with the first preset number from the candidate knowledge subsets according to the evaluation score, and taking the knowledge subsets as the optimal knowledge subsets;
wherein the uncertainty is a property that measures how easily the extraction algorithm determines the candidate knowledge subset to be the correct knowledge subset.
Optionally, after the selecting the first preset number of knowledge subsets from the candidate knowledge subsets according to the evaluation score further includes:
generating a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset;
taking each knowledge subset as a vertex, and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph;
selecting a second preset number of first optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertexes as optimal knowledge subsets;
the first optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, the second preset number is smaller than the first preset number, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
Optionally, the performing, according to the task feedback result, a denoising operation on the knowledge base includes:
when the task feedback result is correct, the first optimal vertex corresponding to the task feedback result is colored to be a first color;
when the task feedback result is wrong, the first optimal vertex corresponding to the task feedback result is colored to be a second color;
according to the consistent relation semantic constraint rule of the semantic constraint rule and the contradictory relation semantic constraint rule, other vertexes are colored into the first color or the second color;
and removing the knowledge subsets corresponding to the vertices of the first directed graph with the second color.
Optionally, the selecting, according to a crowdsourcing task selection algorithm, a first preset number of optimal knowledge subsets from the knowledge subsets includes:
generating a second closed semantic constraint rule according to the semantic constraint rule and the candidate knowledge subset;
taking each candidate knowledge subset as a vertex, and connecting the vertices according to the second closed semantic constraint rule to obtain a second directed graph;
selecting a first preset number of second optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the second optimal vertexes as optimal knowledge subsets;
the second optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
Optionally, the selecting, according to a preset vertex selection algorithm, the first preset number of second optimal vertices from the vertices includes:
dividing the second directed graph into a first sub-graph containing all consistent relations and a second sub-graph containing all contradictory relations according to the consistent relations and the contradictory relations among all vertexes;
decomposing the first subgraph into a set of disjoint paths, wherein any two of the disjoint paths have no common vertex;
analyzing the disjoint paths based on a binary search method to obtain a second optimal vertex;
and taking the vertex with the maximum confidence coefficient and the vertex with zero in-degree in the second subgraph as the second optimal vertex.
Further, the present invention provides an apparatus for knowledge refining, the apparatus comprising:
an acquisition module for acquiring a candidate knowledge subset in an automatically extracted knowledge base;
the optimal selection module is used for selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number;
the task implementation module is used for issuing crowdsourcing tasks based on the optimal knowledge subset to obtain task feedback results;
and the denoising module is used for carrying out denoising operation on the knowledge base according to the task feedback result.
Optionally, the optimal selection module includes:
the uncertainty calculation unit is used for calculating a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence coefficient of the used knowledge extraction algorithm;
the contradictory calculation unit is used for calculating a second numerical value representing the degree of the contradiction of the candidate knowledge subset according to the contradictory relation semantic constraint rules in the semantic constraint rules;
the evaluation unit is used for calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset;
a selecting unit, configured to select the knowledge subsets of the first preset number from the candidate knowledge subsets according to the evaluation scores, and use the knowledge subsets as the optimal knowledge subsets;
wherein the uncertainty is a property that measures how easily the candidate knowledge subset is determined to be correct by the decimation algorithm.
Optionally, the method further comprises:
a closed rule generating module, configured to generate a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset;
the directed graph establishing module is used for taking each knowledge subset as a vertex and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph;
the optimal vertex selection module is used for selecting a second preset number of first optimal vertices from the vertices according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertices as the optimal knowledge subsets;
the first optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, the second preset number is smaller than the first preset number, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
Optionally, the denoising module includes:
the first coloring unit is used for coloring the color of the first optimal vertex corresponding to the task feedback result into a first color when the task feedback result is correct;
the second coloring unit is used for coloring the color of the first optimal vertex corresponding to the task feedback result into a second color when the task feedback result is wrong;
a third coloring unit, configured to color other vertices into the first color or the second color according to the consistent relationship semantic constraint rule of the semantic constraint rules and the contradictory relationship semantic constraint rule;
and the removing unit is used for removing the knowledge subset corresponding to the vertex of which the color on the first directed graph is the second color.
The invention provides a knowledge refining method and a device, which are characterized in that a candidate knowledge subset in an automatically extracted knowledge base is obtained; selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number; issuing crowdsourcing tasks based on the optimal knowledge subsets to obtain task feedback results; and denoising the knowledge base according to the task feedback result. And selecting a preset number of optimal knowledge subsets from the candidate knowledge subsets of the knowledge base to generate a crowdsourcing task, and denoising the knowledge base according to the crowdsourcing task, namely refining the knowledge in the automatically extracted knowledge base based on a crowdsourcing platform, namely removing the noise of the knowledge base by utilizing manual marking. The correctness of the knowledge subsets which are difficult to identify by the traditional automatic refining algorithm can be judged by utilizing manual marking, so that the noise in the knowledge base is less, and the knowledge quality is higher. And a preset number of candidate knowledge subsets are selected to implement crowdsourcing tasks, so that the improvement of knowledge quality can be maximized under the condition of limited resources. Therefore, the method and the device are beneficial to improving the knowledge quality in the knowledge base with automatic extraction.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of one embodiment of a knowledge refining method provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of a pseudo-code implementation of a task selection algorithm based on ranking;
FIG. 3 is a directed graph model corresponding to the candidate knowledge subsets of Table 1;
FIG. 4 is a pseudo-code diagram of a graph-based shading algorithm;
FIG. 5 is a pseudo-code diagram of a single-path implementation algorithm of a path-based vertex selection algorithm;
FIG. 6 is a pseudo-code diagram of a multi-path implementation algorithm of a path-based vertex selection algorithm;
FIG. 7 is a pseudo-code diagram of a topology-based ordering vertex selection algorithm;
FIG. 8 is a pseudo-code diagram of a fault tolerant shading algorithm;
fig. 9 is a block diagram showing a configuration of a knowledge refining apparatus according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a specific implementation of a knowledge refining method according to an embodiment of the present invention, the method including the following steps:
step 101: acquiring a candidate knowledge subset in an automatically extracted knowledge base;
it should be noted that the candidate knowledge subset may refer to a fact in the knowledge base, and the fact may be stored in the knowledge base in a triple form, for example, (china, capital, beijing). The candidate knowledge subset may refer to all facts in the knowledge base or a part of the facts, and is not limited herein.
Step 102: selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number;
it is understood that the crowdsourcing task may refer to a task distributed by a crowdsourcing platform, and the task may be completed by human workers, that is, the crowdsourcing platform issues the crowdsourcing task and generates a corresponding question, for example, a question of "is italy a country" may be presented, and then the workers may give answers to the question to obtain corresponding rewards.
Each candidate knowledge subset may be generated into a corresponding crowdsourcing task to refine the knowledge base, regardless of cost. However, considering that there is a certain cost for refining knowledge by using the crowdsourcing task and the scale of the knowledge base extracted automatically is large, in order to maximize the quality of the knowledge base and reduce the cost as much as possible, that is, to maximize the quality of the knowledge by using limited human resources, a certain number of candidate knowledge subsets may be selected to generate the crowdsourcing tasks instead of generating the crowdsourcing tasks by using all the candidate knowledge subsets. Obviously, it is within the scope of the embodiments of the present invention to refine the knowledge base based on the crowdsourcing platform.
The preset crowdsourcing task quantity may be a crowdsourcing budget, that is, a maximum value of the number of crowdsourcing tasks issued when the refining knowledge base is preset. The preset crowdsourcing task quantity can be set according to actual conditions. The first preset number smaller than or equal to the preset crowdsourcing task number can mean that the number of actually issued crowdsourcing tasks should be smaller than or equal to the preset crowdsourcing task number, that is, the issued crowdsourcing tasks should be within the crowdsourcing budget.
It should be noted that the optimal knowledge subset may refer to a knowledge subset that is beneficial for improving the quality of the knowledge base.
The above-mentioned crowdsourcing task selection algorithm may refer to an algorithm for determining which portion of the subset of candidate knowledge is to be a crowdsourcing task. The method is based on semantic constraint rules, namely utilizes the consistency and contradiction of the semantic constraint rules to carry out reasoning judgment to implement the effectiveness of crowdsourcing tasks on the candidate knowledge subsets.
The semantic constraint rules can be learned from training data or obtained according to ontology constraints. And ontology constraints can be viewed as axioms or rules of first-order predicate logic. For example, an entity category constraint of "athlete is a person" may be expressed by the following rules: athhlete (x) > person (x). Similarly, City (x) ═! Person (x) may then indicate that the city is not a person.
While ontology constraints are mainly of seven types, including relationships between classes and relationships (e.g., birds are animals), exclusive relationships between classes and relationships (e.g., humans are not places), inverted relationships (e.g., Team hasflag and Play For Team), the domain and value range of relationships (e.g., the domain of the "city length" relationship is a city and the value range is a human), the mapping cardinality of relationships (e.g., a country can only have one capital city), antisymmetry (e.g., site a is located in site B, then site B cannot be located in a), and antisymmetry (e.g., site a cannot be located in site a itself).
To better represent the seven ontological constraints described above, corresponding coincidences can be used for representation. For example, Sub may be used to represent the containment relationship of a category, RSub represents the containment relationship of a relationship; mut represents the mutual exclusion relationship of the categories, and RMut represents the mutual exclusion relationship of the relationships; inv denotes an inverse relationship; dom represents the definition domain of the relation, Ran represents the value domain of the relation; fun represents that the mapping base number of the relationship is one-to-one; AntiSym denotes the antisymmetry of the relationship; AntiRef represents the inverse of a relationship.
Further, Cat (x, y) and Rel (x, y, r) may be used to represent category and attribute relationships, respectively. Namely Cat (x, c) indicates that x is an entity of a category c, and Rel (x, y, r) indicates that x and y have a relationship r. For example, facts may be expressed as Cat (Tiger Words, Athlete), and Rel (Lakers, Basketball, TeamplaysSports).
According to truth relationships of facts, we divide semantic constraints into two types: contradictory relationships and consistent relationships. Contradictory relationships may be used as a basis for inference, i.e., if a subset of candidate knowledge is true, it may be inferred that another subset of candidate knowledge must not be true. According to the ontology constraint, the following 9 semantic constraint rules of contradiction relationship can be obtained, which are respectively:
Figure BDA0001257872060000091
Figure BDA0001257872060000092
Figure BDA0001257872060000093
Figure BDA0001257872060000094
Figure BDA0001257872060000095
Figure BDA0001257872060000096
Figure BDA0001257872060000097
Figure BDA0001257872060000098
Figure BDA0001257872060000099
for the first contradictory relationship semantic constraint rule, Mut (c1, c2) means that c1 does not belong to c2, Cat (x.c1) means that x belongs to c1, and then it can be concluded that x does not belong to c 2. Similarly, the specific meanings of the semantic constraint rules of other contradictory relationships can be inferred accordingly, and are not further described herein.
While a consensus relationship may mean that if one subset of candidate knowledge is true, it may be inferred that another subset of candidate knowledge is also necessarily true. According to the ontology constraint, 5 consistent relation semantic constraint rules can be obtained, which are respectively as follows:
Figure BDA00012578720600000910
Figure BDA00012578720600000911
Figure BDA00012578720600000912
Figure BDA00012578720600000913
Figure BDA00012578720600000914
wherein, for the semantic constraint rule 10, Sub (c1, c2) indicates that c1 belongs to c2, Cat (x.c1) indicates that x belongs to c1, it can be deduced that x belongs to c2, namely Cat (x, c 2). Similarly, the specific meanings of the semantic constraint rules of other consistency relationships can be inferred correspondingly, and are not described herein again.
On the basis of the semantic constraint rule, a crowdsourcing task selection algorithm based on sequencing can be generated, namely, each candidate knowledge subset is scored by using the uncertainty and the contradiction of the knowledge subsets to obtain a corresponding evaluation score, a certain number of candidate knowledge subsets are selected according to the evaluation score, and crowdsourcing tasks are implemented; the method can also generate a crowdsourcing task selection algorithm based on the graph, namely modeling the candidate knowledge subsets to obtain a directed graph, wherein each candidate knowledge subset is used as a vertex, all the vertices are connected according to consistency relations and contradiction relations among the candidate knowledge subsets, then a certain number of optimal vertices are selected from the directed graph, and the candidate knowledge subsets corresponding to the optimal vertices are used as knowledge subsets for implementing crowdsourcing tasks.
It is to be understood that the above-described crowdsourcing task selection algorithm may refer to the ranking-based crowdsourcing task selection algorithm alone; or may refer to a graph-based crowd-sourced task selection algorithm alone; the knowledge subset selection method can also refer to an algorithm combining a crowdsourcing task selection algorithm based on sequencing and a crowdsourcing task selection algorithm based on a graph, at the moment, after a certain number of knowledge subsets are selected by the crowdsourcing task selection algorithm based on sequencing, the correctness of other knowledge subsets is reasoned by further utilizing a consistent relation semantic constraint rule of the semantic constraint rule, namely, the knowledge subsets selected by the crowdsourcing task selection algorithm based on sequencing are subjected to graph modeling, and then the knowledge subsets are further screened by utilizing the crowdsourcing task selection algorithm based on sequencing.
Step 103: issuing a crowdsourcing task based on the optimal knowledge subset to obtain a task feedback result;
specifically, the crowdsourcing platform may generate crowdsourcing tasks according to each selected optimal knowledge subset, and issue the crowdsourcing tasks, that is, generate corresponding questions according to the optimal knowledge subsets, and then receive answers of the tasks, that is, task feedback results.
Step 104: and carrying out denoising operation on the knowledge base according to the task feedback result.
It should be noted that the above-mentioned denoising operation may refer to removing some incorrect knowledge subsets, and in this case, the incorrect knowledge subsets or unreliable knowledge subsets are used as the noise of the knowledge base.
It will be appreciated that when the crowd-sourced task selection algorithm is a graph-based task selection algorithm, the de-noising process described above may be converted to a graph vertex coloring process, i.e., each vertex in the directed graph is colored with a corresponding color, e.g., vertices where the subset of candidate knowledge is incorrect may be colored red, while vertices where the subset of candidate knowledge is correct may be colored green. After all vertices are colored, vertices that are red in color can be removed, vertices that are green in color are retained, i.e., the wrong knowledge subset is removed, and the correct knowledge subset is retained.
The knowledge refining method provided by the embodiment of the invention obtains the candidate knowledge subsets in the automatically extracted knowledge base; selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number; issuing crowdsourcing tasks based on the optimal knowledge subsets to obtain task feedback results; and denoising the knowledge base according to the task feedback result. And selecting a preset number of optimal knowledge subsets from the candidate knowledge subsets of the knowledge base to generate a crowdsourcing task, and denoising the knowledge base according to the crowdsourcing task, namely refining the knowledge in the automatically extracted knowledge base based on a crowdsourcing platform, namely removing the noise of the knowledge base by utilizing manual marking. The correctness of the knowledge subsets which are difficult to identify by the traditional automatic refining algorithm can be judged by utilizing manual marking, so that the noise in the knowledge base is less, and the knowledge quality is higher. And a preset number of candidate knowledge subsets are selected to implement crowdsourcing tasks, so that the improvement of knowledge quality can be maximized under the condition of limited resources. It can be seen that the method is beneficial to improving the knowledge quality in the automatically extracted knowledge base.
Since the crowdsourcing task selection algorithm may be any one of a ranking-based task selection algorithm and a graph-based task selection algorithm, in order to better describe a specific process of the algorithm, the ranking-based task selection algorithm will be first described in detail below.
The specific implementation process of the task selection algorithm based on the ranking can be seen in fig. 2, and fig. 2 is a schematic diagram of a pseudo code implementation of the task selection algorithm based on the ranking.
Therefore, on the basis of the above embodiment, the above process of selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to the crowdsourcing task selection algorithm may specifically be: calculating to obtain a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence of the used knowledge extraction algorithm; according to the contradictory relation semantic constraint rules in the semantic constraint rules, calculating to obtain a second numerical value representing the degree of contradiction of the candidate knowledge subsets; calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset; selecting the knowledge subsets with the first preset number from the candidate knowledge subsets according to the evaluation score, and taking the knowledge subsets as the optimal knowledge subsets; wherein the uncertainty is a property that measures how easily the extraction algorithm determines the candidate knowledge subset to be the correct knowledge subset.
It should be noted that the preset threshold may be a numerical value used to define whether the knowledge subset is correct, and the candidate knowledge subset with the confidence level greater than the preset threshold may be considered to be correct, and the candidate knowledge subset with the confidence level less than the preset threshold may be considered to be incorrect.
The confidence degree may refer to the confidence degree of the extraction algorithm, and the confidence degree values may also be different according to different extraction algorithms, and the confidence degree is not necessarily equal to the probability value. At this time, when the confidence coefficient is not equal to the probability value, the probability value can be recalculated, that is, a logic regression model can be obtained by learning by using a labeled training data set, and then the logic regression model is used for predicting whether the candidate knowledge subset is the true probability value; without the training data, the confidence provided by the information extraction system can be directly used as the probability value.
The first value may refer to a value characterizing Uncertainty of the candidate knowledge subset, which may be specifically represented by the formula unrnterainty (t)i)=1-|confm(ti) -T | (15) is calculated. Wherein T is a preset threshold value, TiAs a candidate knowledge subset, confm(ti) Represents tiConfidence of the candidate knowledge subset, Uncertainty (t)i) Is the uncertainty value of the candidate knowledge subset ti.
While the uncertainty of a candidate knowledge subset may refer to the ease with which a metric algorithm determines its correctness. The higher the uncertainty of a candidate knowledge subset, the more uncertain it is, i.e. the higher the difficulty of determining the correctness of the candidate knowledge subset by an algorithm.
The second value may refer toThe numerical value characterizing the inconsistency of the candidate knowledge subset, i.e., the inconsistency of the candidate knowledge subset, may refer to a measure of the error risk of the candidate knowledge subset and its degree of importance. Which may be embodied by the formula
Figure BDA0001257872060000121
And (4) calculating. Wherein n isj(ti) Refers to the fact tiViolation of semantic constraint rule FjThe number of closure rules of (1) is to ensure that the score is greater than zero, Contraditoroiness (t)i) As a candidate knowledge subset tiThe contradictory values of (a).
It will be appreciated that the more closed rules of the semantic constraint rules a candidate knowledge subset violates, the greater the likelihood of its error. For example, if there are semantic constraint rules Mut ((counter, bird) and Mut ((city, bird), and candidate knowledge subsets Cat (Italy, country), Cat (Italy, city) and Cat (Italy, bird), for candidate knowledge subset Cat (Italy, bird), which violates the closed rules of the two semantic constraint rules Mut ((counter, bird) and Mut ((city, bird), then n1(Cat (Italy, bird)) > 2.
The preset evaluation function may refer to a function for evaluating effectiveness of each candidate knowledge subset for improving quality of the knowledge base based on uncertainty and contradiction of the candidate knowledge subsets. It may be embodied as Score (t)i)=Uncertainty(ti)*Contradictoriness(ti)(17)。
According to the evaluation scores of the candidate knowledge subsets, a preset number of knowledge subsets can be selected as the optimal knowledge subset to implement the crowdsourcing task. Specifically, the candidate knowledge subsets may be sorted according to the evaluation scores, and may be sorted from high to low, or sorted from low to high. In general, the top K candidate knowledge subsets with the largest evaluation scores may be selected to perform the crowdsourcing task. Of course, the value of K should be smaller than or equal to the predetermined crowdsourcing task number.
It can be seen that the knowledge refining method provided by the embodiment of the invention selects the optimal knowledge subset within the range of the crowdsourcing budget by using the task selection algorithm based on the sequencing, implements the crowdsourcing task, and can maximally improve the knowledge quality in the automatically extracted knowledge base by using limited human resources.
Since the crowdsourcing task selection algorithm may be any one of a ranking-based task selection algorithm and a graph-based task selection algorithm, the graph-based task selection algorithm will be described in detail below in order to better describe a specific process of the algorithm.
Therefore, on the basis of the above embodiment, the above process of selecting the first preset number of optimal knowledge subsets from the knowledge subsets according to the crowdsourcing task selection algorithm may specifically be: generating a second closed semantic constraint rule according to the semantic constraint rule and the candidate knowledge subset; taking each candidate knowledge subset as a vertex, and connecting the vertices according to the second closed semantic constraint rule to obtain a second directed graph; selecting a first preset number of second optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the second optimal vertexes as optimal knowledge subsets; the second optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
It should be noted that the second closed semantic constraint rule may be obtained according to the semantic rule and the given candidate knowledge subset. For example, given a semantic constraint rule such as Table 1-1, and a candidate knowledge subset such as Table 1-2, a closed semantic constraint rule such as Table 1-3 can be derived.
Ontological Relations
Domain(ceoof,ceo)
Ran(ceoof,company)
Sub(ceo,person)
RSub(ceoof,topmemberoforganization)
RSub(topmemberoforganization,worksfor)
RSub(topmemberoforganization,personleadsorganization)
RSub(worksfor,personbelongstoorganization)
RMut(topmemberoforganization,organizationleadbyperson)
TABLE 1-1
Figure BDA0001257872060000141
Tables 1 to 2
Grounding Rules
t1=>t2
t1=>t3
t1=>t4
t2=>t7
t4=>t6
t4=>t7
t6=>t8
t8=>t7
t5=>!t6
t6=>!t5
Tables 1 to 3
The second directed graph may refer to a graph model of the candidate knowledge subsets, each vertex of the graph model corresponds to one candidate knowledge subset, and a connecting line between each vertex in the graph represents a consistent relationship or a contradictory relationship between each candidate knowledge subset. For example, referring to fig. 3, fig. 3 is a directed graph model corresponding to the candidate knowledge subsets in table 1.
As shown in fig. 3, there is t1、t2、t3、t4、t5、t6、t7And t8Wait to select a subset of knowledge, at which time t corresponds to1、t2、t3、t4、t5、t6、t7And t8And 8 vertices are equal. According to the closed semantic constraint rules in tables 1-3, the vertices can be connected by solid arrows or dashed arrows to obtain the corresponding directed graph model. Obviously, the dashed arrow connecting lines in the figure represent contradictory relationships between two vertices, and the solid arrow connecting lines represent consistent relationships between two vertices.
Building a graphAfter the model is built, the graph can be divided into a consistent subgraph and a contradictory subgraph according to the relations in the graph, and G is used for each graphpAnd GcAnd (4) showing. Wherein G isp=(V,Ep) The representation only contains edge EpSubfigure of (1), Gc=(V,Ec) The representation only contains edge EcIs shown in the figure. And EpRepresenting all consistent relationships, EcAll contradictory relationships are represented. For example, for two candidate knowledge subsets tiAnd tjIf t isi=>tjThen there is a directed edge E ∈ EpFrom tiTo tjRepresenting the consistent relationship; if t isi=>!tjThen there is a directed edge E ∈ EcFrom tiTo tjThe contradictory relationships are expressed.
The candidate knowledge subsets are modeled into a graph, at this time, the knowledge refining problem can be converted into a graph coloring problem, namely, each vertex in the graph is colored according to a certain coloring strategy, and after all the vertices in the graph are colored, the correctness conditions of all the candidate knowledge subsets are also known.
In general, there are two possibilities for the candidate knowledge subset, correct and incorrect, and accordingly, there are two possibilities for the vertices in the graph. To better distinguish the cases of the individual vertices, the two possibilities of the vertices can be distinguished with different colors. For example, vertices corresponding to the correct candidate knowledge subset may be colored green and vertices corresponding to the incorrect candidate knowledge subset may be colored red.
Due to the fact that the knowledge base is large in size and the number of the included candidate knowledge subsets is large, all the candidate knowledge subsets cannot be subjected to crowdsourcing tasks, and the number of the crowdsourcing tasks can be reduced by means of an effective coloring framework. The coloring process may be implemented by a coloring algorithm, please refer to fig. 4, and fig. 4 is a pseudo code diagram of a graph-based coloring algorithm.
The graph coloring algorithm can firstly construct a graph model by using a closed constraint rule (1-2 lines of pseudo codes in the graph), and then select an uncolored vertex ti(line 3 pseudo-code in the figure). Will wait againSelecting a subset of knowledge tiAnd implementing the crowdsourcing task, and coloring according to the crowdsourcing task result. If tiIf it is correct, then not only the vertex t can be replacediIs colored green, and t can be colorediAt GpAll sub-vertices d in (1)p(ti) Is colored green, and t isiAnd dp(ti) At GcAnd the sub-vertices in GpThe parent vertex in (1) is colored red (lines 6-8 pseudo code in the figure). That is, for ti=>tjT can be deducedjIs also true; for ti=>!tjT can be deducedjIs incorrect. If tiNot only can t be incorrectiColoring red, and optionally coloring tiAt GpAll parent vertices in (b) are colored red (line 10 pseudo-code in the figure). When all vertices have been colored, the coloring process ends. Otherwise, the shading algorithm selects the next uncolored vertex and repeats the above steps.
It will be appreciated that in order to make reasonable use of the crowdsourcing budget, i.e. to keep the number of crowdsourcing tasks within the crowdsourcing budget, to reduce the crowdsourcing cost, as few candidate subsets of knowledge may be selected to perform the crowdsourcing tasks as possible, i.e. as few vertices as possible are selected from the multitude of vertices in the graph, so that all vertices in the graph may be colored.
To better describe the optimal vertices, boundary vertices may be introduced, which may be identical to the boundary vertices. And the definition of the boundary vertices may be: a vertex is a boundary vertex if the color of the vertex cannot be inferred from the colors of other vertices.
The second optimal vertex may refer to a boundary vertex in the second directed graph. The optimal vertex selection method can be a vertex selection algorithm based on a path and a vertex selection algorithm based on topological sorting.
In some embodiments of the present invention, the process of selecting the first preset number of second optimal vertices from the vertices according to the preset vertex selection algorithm may specifically be: dividing the second directed graph into a first sub-graph containing all consistent relations and a second sub-graph containing all contradictory relations according to the consistent relations and the contradictory relations among all vertexes; decomposing the first subgraph into a set of disjoint paths, wherein any two of the disjoint paths have no common vertex; analyzing the disjoint paths based on a binary search method to obtain a second optimal vertex; and taking the vertex with the maximum confidence coefficient and the vertex with zero in-degree in the second subgraph as the second optimal vertex.
It will be appreciated that the first sub-figure referred to above may refer to G as referred to abovepFor a detailed description, please refer to the above corresponding contents, which are not described herein again.
The second sub-figure may refer to G mentioned hereinbeforecFor a detailed description, please refer to the above corresponding contents, which are not described herein again.
For convenience of subsequent description, V may be usedBTo represent the set of boundary vertices, V, in graph GB(Gp) And VB(Gc) Representing the boundary vertices of subgraphs Gp and Gc, respectively.
It can be understood that VBThe vertex color in (1) can be neither according to GpA reasoning is obtained, nor can it be based on GcAnd (6) reasoning to obtain. And VBThe vertex colors in (1) all have to be labeled manually because of VBThe color of the vertex in (1) cannot be derived by inference. However, since the truth values of the vertices in the graph are unknown, these boundary vertices cannot be identified in advance. To solve this problem, G with theoretical upper bound guarantees is proposedpBoundary vertex recognition algorithm and greedy-based GcAnd (4) a boundary vertex identification algorithm. At the same time, since in general GcWith a greater number of boundary vertices, i.e. at GcRatio of influence of the middle boundary vertices to GpMore limited, G can be considered as a prioritypBoundary vertices to reduce the number of candidate vertices.
GpThe specific process of the boundary vertex identification algorithm may be as follows: selecting middle vertex of path, determining next step according to crowdsourcing result of middle vertex, if middle vertex pairIf the corresponding candidate knowledge subset is correct, the color of its child vertex can be inferred, but the color of its parent vertex cannot be inferred. Thus, the next crowd-sourced vertex can be selected as the intermediate vertex between the current vertex and the path start vertex. If the candidate knowledge subset corresponding to the intermediate vertex is wrong, the color of the parent vertex can be obtained by inference, but the color of the child vertex cannot be obtained by inference. Thus, the next crowd-sourced vertex can be selected as the intermediate vertex between the current vertex and the path terminating vertex. By such iteration, all boundary vertices can be found. Assuming the number of vertices on path P is | P |, then the number of crowd-sourced vertices is O (log | P |).
And greedy based GcThe specific process of the boundary vertex identification algorithm may be as follows: greedy selection of GcA vertex with a degree of medians of 0 and a set of contradictory vertices (i.e., G)cCluster of (d) the most confident vertex.
Path-based vertex selection algorithm first computes G by maximum matchingpSelecting an optimal vertex on the longest path in the disjoint paths by using a binary search method to execute a crowdsourcing task, and coloring the vertex in the graph by using a coloring strategy; the colored vertices are then removed from the graph and the above steps are repeated. And when no path length is larger than 1, selecting an optimal vertex according to the Gc to execute the crowdsourcing task. The specific implementation process can refer to the algorithm in fig. 5, and fig. 5 is a pseudo code diagram of a single-path implementation algorithm of a path-based vertex selection algorithm.
It will be appreciated that the graph G may bepDecomposed into a set of disjoint paths (i.e., any two paths do not have a vertex in common), the maximum length of a path is set to | V |, so the number of crowd-sourced vertices is O (β log | V |), where β is the number of disjoint paths.
While to find disjoint paths, graph G may bepConverted into a bipartite graph
Figure BDA0001257872060000181
Wherein
Figure BDA0001257872060000182
If there is an edge (v)1,v2)∈EpThen at v1∈V1 bAnd
Figure BDA0001257872060000183
there is also an edge between them.
Figure BDA0001257872060000184
The maximum match may mean that any two edges are at
Figure BDA0001257872060000185
And
Figure BDA0001257872060000186
there is no maximum set of edges in the vertex, i.e., any two edges (v, v ') and (u, u '), v ≠ u and v ≠ u ', in the maximum match.
For convenience of description, the maximum match may be represented by M, Y1Representing the first set of vertices in M, Y2Representing the second set of vertices in M. While
Figure BDA0001257872060000187
Representing a set of vertices without in-degree, the vertex without in-degree may be taken as the first vertex of the path. For each such vertex v, if it has an edge (v, v '), let v' be the second vertex of the path; then, it is checked whether v 'has an edge (v', v "). By such iteration, a path starting at v can be found. And paths calculated by maximum matching have disjointness, completeness and minimization.
While a path-based vertex selection algorithm may only publish a single subset of candidate knowledge to the crowdsourcing platform at a time, significant latency may occur. To overcome this drawback, a path-based algorithm may be made to support parallel processing, such that multiple vertices may be selected during each iteration while performing a crowdsourcing task. The specific implementation process can be realized by the algorithm shown in fig. 6, and fig. 6 is based onPseudo code schematic of a multi-path implementation algorithm of a path vertex selection algorithm. The algorithm first calculates Gpβ, selecting an optimal vertex from each path by using an optimal vertex selection strategy, simultaneously releasing candidate facts corresponding to the multiple selected vertices to a crowdsourcing platform, coloring the vertices in the graph according to crowdsourcing answers, removing the colored vertices, and repeating the steps until all the vertices are colored.
It is considered that the multi-path concurrent algorithm may generate conflict in the coloring process. For example, assume tiColoured green, tjColoring red if ti=>t and t ═>tjThen a conflict will arise when coloring t, since according to tiReasoning yields that t is green and according to tjReasoning gave t as red. A majority voting mechanism can be used to resolve this conflict, and when there are as many votes in two colors, one of the two colors is randomly chosen to be colored.
Since the path-based vertex selection algorithm needs to calculate the complexity of the maximum matching, the complexity in practical application is slow for a large-scale knowledge base. To solve this problem, a vertex selection algorithm based on topological ordering is proposed. The algorithm first finds a vertex set with an in-degree of 0, denoted as L1(ii) a Then, the vertices are deleted from the graph and another vertex set with the in degree of 0, which is marked as L, is found2(ii) a The above steps are repeated until all vertices are deleted. Obviously, subset LiIs 0, so each LiCan be viewed as a set of independent vertices.
The specific implementation process of the vertex selection algorithm based on the topological sorting may be as shown in fig. 7, and fig. 7 is a pseudo code diagram of the vertex selection algorithm based on the topological sorting. The algorithm obtains G through calculationpTopological ordered vertex set L1,L2,…L|L|Parallel crowdsourcing of intermediate layer vertex sets
Figure BDA0001257872060000191
When the absolute value L is less than or equal to 1,according to GcA crowd-sourced vertex is selected. G is then colored according to the crowd-sourced answers and the coloring policy, removing the colored vertices. The above steps are repeated until all vertices have been colored.
It can be seen that the knowledge refining method provided by the embodiment of the invention utilizes the task selection algorithm based on the graph to reason the correctness of the relevant candidate knowledge subsets through the semantic constraint rule, thereby further reducing the number of candidate knowledge subsets for implementing the crowdsourcing task and further reducing the cost of knowledge refining. Meanwhile, by selecting the optimal vertexes as few as possible, namely selecting the candidate knowledge subsets which can be reduced as soon as possible to implement the crowdsourcing task, the minimum human resources can be utilized, and the improvement of the knowledge quality in the knowledge base is maximized.
Since the crowdsourcing task selection algorithm may be any one of a task selection algorithm based on ranking and a task selection algorithm based on a graph, or the task selection algorithm based on ranking and the task selection algorithm based on a graph may be combined, in order to better introduce a specific process of the algorithm, the crowdsourcing task selection algorithm after the task selection algorithm based on ranking and the task selection algorithm based on a graph are combined will be introduced below.
Therefore, on the basis of the above-described embodiment of the ranking-based task selection algorithm, after selecting the first preset number of knowledge subsets from the candidate knowledge subsets according to the evaluation scores, the method may further include: generating a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset; taking each knowledge subset as a vertex, and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph; selecting a second preset number of first optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertexes as optimal knowledge subsets; the first optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, the second preset number is smaller than the first preset number, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
It should be noted that, in view of the fact that the graph-based crowdsourcing task selection algorithm can perform inference by using the consistent relation semantic constraint rule and the contradictory relation semantic constraint rule to determine the correctness of some optimal knowledge subsets, the graph-based crowdsourcing task selection algorithm can be used to perform inference judgment on the selected optimal knowledge subset after the optimal knowledge subset is selected by using the ranking-based crowdsourcing task selection algorithm, so as to further reduce crowdsourcing tasks and reduce cost.
At this time, the knowledge subsets selected by the sorting-based crowdsourcing task selection algorithm can be modeled, that is, each selected knowledge subset is used as a vertex, and then the correctness of each knowledge subset is inferred according to the semantic constraint rule.
Obviously, after the graph model is built, the subsequent processing steps are similar to the individual graph-based crowdsourcing task selection algorithm, so the related processing steps can be referred to the introduction of the graph-based crowdsourcing task selection algorithm, and are not described herein again.
The knowledge refining problem can be converted into the graph coloring problem by the graph-based crowdsourcing task selection algorithm, so in some embodiments of the present invention, the process of performing denoising operation on the knowledge base according to the task feedback result may specifically be: when the task feedback result is correct, the first optimal vertex corresponding to the task feedback result is colored to be a first color; when the task feedback result is wrong, the first optimal vertex corresponding to the task feedback result is colored to be a second color; according to the consistent relation semantic constraint rule of the semantic constraint rule and the contradictory relation semantic constraint rule, other vertexes are colored into the first color or the second color; and removing the knowledge subsets corresponding to the vertices of the first directed graph with the second color.
The first color and the second color may be arbitrarily set, and for example, the first color may be set to green, and the second color may be set to red.
Considering that a graph-based crowd-sourced task selection algorithm may be in error, one error is that caused by a worker's mistake, e.g., assuming some candidate knowledge subset tiIs incorrect, but the worker incorrectly marks it as correct, then this error is due to the worker's mistake; another type of error is an error propagated through inference rules, e.g., assuming a sum of tiContradictory another candidate knowledge subset tjTrue values are correct, and the graph-based crowd-sourced task selection algorithm incorrectly assigns t according to the inference rulejLabeled as incorrect, this error is propagated through the inference rules.
To tolerate worker errors, each candidate knowledge subset may be assigned to multiple workers, and the answers of the multiple workers may then be combined to derive a confidence level for the worker answers of a candidate knowledge subset. For example, assume that each candidate knowledge subset is assigned to z workers,
Figure BDA0001257872060000211
if a worker has delivered a consistent answer (e.g., "correct"), and z-y workers have delivered another answer (e.g., "incorrect"), then the confidence level of the worker's answers is
Figure BDA0001257872060000221
To overcome the errors propagated through inference rules, logistic regression models can be used to determine the uncertain vertex colors. Generally, if the confidence of worker answers for a candidate subset of knowledge is greater than a certain threshold (e.g., greater than 0.8), then the vertices in the graph may be colored using a coloring policy. Otherwise, it is colored first to a third color (e.g., blue) and the vertex is not utilized to color other vertices; the other colored vertices are then used as true values and used to color the vertices of a third color (i.e., indeterminate vertices). In particular, a logistic regression model of the confidence of the candidate knowledge subsets (the confidence provided by the information extraction system) may be trained using the colored vertices and used to predict the color of the uncertain vertices. A specific fault-tolerant shading algorithm is shown in fig. 8, and fig. 8 is a pseudo code diagram of the fault-tolerant shading algorithm. The algorithm only colors other vertices (line 5 code in the figure) using a coloring strategy when the vertices have high-confidence answers, and finally colors the uncertain vertices using a logistic regression model (lines 7-16 code in the figure).
It can be seen that the knowledge refining method provided by the embodiment of the invention further utilizes the crowd-sourced task selection algorithm based on the graph to reason the knowledge subset on the basis of the sorted crowd-sourced task selection algorithm, so as to further reduce the crowd-sourced tasks and reduce the cost. Meanwhile, a fault-tolerant coloring processing technology is provided, and the accuracy of image coloring is improved.
The knowledge refining apparatus provided by the embodiment of the present invention is described below, and the knowledge refining apparatus described below and the knowledge refining method described above may be referred to in correspondence with each other.
Fig. 9 is a block diagram of a knowledge refining apparatus according to an embodiment of the present invention, and with reference to fig. 9, the knowledge refining apparatus may include:
an obtaining module 901, configured to obtain a candidate knowledge subset in an automatically extracted knowledge base;
an optimal selection module 902, configured to select, according to a crowdsourcing task selection algorithm, a first preset number of optimal knowledge subsets from the candidate knowledge subsets, where the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to a preset crowdsourcing task number;
a task implementation module 903, configured to issue a crowdsourcing task based on the optimal knowledge subset to obtain a task feedback result;
and the denoising module 904 is configured to perform denoising operation on the knowledge base according to the task feedback result.
Optionally, the optimal selection module includes:
the uncertainty calculation unit is used for calculating a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence coefficient of the used knowledge extraction algorithm;
the contradictory calculation unit is used for calculating a second numerical value representing the degree of the contradiction of the candidate knowledge subset according to the contradictory relation semantic constraint rules in the semantic constraint rules;
the evaluation unit is used for calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset;
a selecting unit, configured to select the knowledge subsets of the first preset number from the candidate knowledge subsets according to the evaluation scores, and use the knowledge subsets as the optimal knowledge subsets;
wherein the uncertainty is a property that measures how easily the candidate knowledge subset is determined to be correct by the decimation algorithm.
Optionally, the method further comprises:
a closed rule generating module, configured to generate a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset;
the directed graph establishing module is used for taking each knowledge subset as a vertex and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph;
the optimal vertex selection module is used for selecting a second preset number of first optimal vertices from the vertices according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertices as the optimal knowledge subsets;
the first optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, the second preset number is smaller than the first preset number, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
Optionally, the denoising module includes:
the first coloring unit is used for coloring the color of the first optimal vertex corresponding to the task feedback result into a first color when the task feedback result is correct;
the second coloring unit is used for coloring the color of the first optimal vertex corresponding to the task feedback result into a second color when the task feedback result is wrong;
a third coloring unit, configured to color other vertices into the first color or the second color according to the consistent relationship semantic constraint rule of the semantic constraint rules and the contradictory relationship semantic constraint rule;
and the removing unit is used for removing the knowledge subset corresponding to the vertex of which the color on the first directed graph is the second color.
The knowledge refining device provided by the embodiment of the invention selects a preset number of optimal knowledge subsets from the candidate knowledge subsets of the knowledge base to generate the crowdsourcing task, and denoises the knowledge base according to the result of the crowdsourcing task, namely, the knowledge in the knowledge base which is automatically extracted is refined based on a crowdsourcing platform, namely, the noise of the knowledge base is removed by utilizing manual marking. The correctness of the knowledge subsets which are difficult to identify by the traditional automatic refining algorithm can be judged by utilizing manual marking, so that the noise in the knowledge base is less, and the knowledge quality is higher. And a preset number of candidate knowledge subsets are selected to implement crowdsourcing tasks, so that the improvement of knowledge quality can be maximized under the condition of limited resources. It can be seen that the apparatus is advantageous for improving the knowledge quality in the automatically extracted knowledge base.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The method and apparatus for refining knowledge provided by the present invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (10)

1. A method of knowledge refining, comprising:
acquiring a candidate knowledge subset in an automatically extracted knowledge base; the knowledge base is constructed by Web data in an information extraction system;
selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number; the optimal knowledge subset refers to a knowledge subset capable of improving the quality of a knowledge base;
issuing a crowdsourcing task based on the optimal knowledge subset to obtain a task feedback result;
and carrying out denoising operation on the knowledge base according to the task feedback result.
2. The method of claim 1, wherein said selecting a first preset number of optimal knowledge subsets from said candidate knowledge subsets according to a crowdsourcing task selection algorithm comprises:
calculating to obtain a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence of the used knowledge extraction algorithm;
according to the contradictory relation semantic constraint rules in the semantic constraint rules, calculating to obtain a second numerical value representing the degree of contradiction of the candidate knowledge subsets;
calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset;
selecting the knowledge subsets with the first preset number from the candidate knowledge subsets according to the evaluation score, and taking the knowledge subsets as the optimal knowledge subsets;
wherein the uncertainty is a property that measures how easily the extraction algorithm determines the candidate knowledge subset to be the correct knowledge subset.
3. The method of claim 2, wherein after said selecting said first predetermined number of subsets of knowledge from said candidate subsets of knowledge based on said evaluation score further comprises:
generating a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset;
taking each knowledge subset as a vertex, and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph;
selecting a second preset number of first optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertexes as optimal knowledge subsets;
the first optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, the second preset number is smaller than the first preset number, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
4. The method of claim 3, wherein said denoising the knowledge base according to the task feedback result comprises:
when the task feedback result is correct, the first optimal vertex corresponding to the task feedback result is colored to be a first color;
when the task feedback result is wrong, the first optimal vertex corresponding to the task feedback result is colored to be a second color;
according to the consistent relation semantic constraint rule of the semantic constraint rule and the contradictory relation semantic constraint rule, other vertexes are colored into the first color or the second color;
and removing the knowledge subsets corresponding to the vertices of the first directed graph with the second color.
5. The method of claim 1, wherein said picking a first preset number of optimal knowledge subsets from said knowledge subsets according to a crowdsourcing task selection algorithm comprises:
generating a second closed semantic constraint rule according to the semantic constraint rule and the candidate knowledge subset;
taking each candidate knowledge subset as a vertex, and connecting the vertices according to the second closed semantic constraint rule to obtain a second directed graph;
selecting a first preset number of second optimal vertexes from the vertexes according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the second optimal vertexes as optimal knowledge subsets;
the second optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
6. The method of claim 5, wherein said selecting said first predetermined number of second optimal vertices from said vertices according to a predetermined vertex selection algorithm comprises:
dividing the second directed graph into a first sub-graph containing all consistent relations and a second sub-graph containing all contradictory relations according to the consistent relations and the contradictory relations among all vertexes;
decomposing the first subgraph into a set of disjoint paths, wherein any two of the disjoint paths have no common vertex;
analyzing the disjoint paths based on a binary search method to obtain a second optimal vertex;
and taking the vertex with the maximum confidence coefficient and the vertex with zero in-degree in the second subgraph as the second optimal vertex.
7. An apparatus for knowledge refining, comprising:
an acquisition module for acquiring a candidate knowledge subset in an automatically extracted knowledge base; the knowledge base is constructed by Web data in an information extraction system;
the optimal selection module is used for selecting a first preset number of optimal knowledge subsets from the candidate knowledge subsets according to a crowdsourcing task selection algorithm, wherein the crowdsourcing task selection algorithm is an algorithm based on a semantic constraint rule, and the first preset number is less than or equal to the preset crowdsourcing task number; the optimal knowledge subset refers to a knowledge subset capable of improving the quality of a knowledge base;
the task implementation module is used for issuing crowdsourcing tasks based on the optimal knowledge subset to obtain task feedback results;
and the denoising module is used for carrying out denoising operation on the knowledge base according to the task feedback result.
8. The apparatus of claim 7, wherein the optimal selection module comprises:
the uncertainty calculation unit is used for calculating a first numerical value representing the uncertainty of the candidate knowledge subset according to a preset threshold and the confidence coefficient of the used knowledge extraction algorithm;
the contradictory calculation unit is used for calculating a second numerical value representing the degree of the contradiction of the candidate knowledge subset according to the contradictory relation semantic constraint rules in the semantic constraint rules;
the evaluation unit is used for calculating the first numerical value and the second numerical value based on a preset evaluation function to obtain an evaluation score of each candidate knowledge subset;
a selecting unit, configured to select the knowledge subsets of the first preset number from the candidate knowledge subsets according to the evaluation scores, and use the knowledge subsets as the optimal knowledge subsets;
wherein the uncertainty is a property that measures how easily the candidate knowledge subset is determined to be correct by the decimation algorithm.
9. The apparatus of claim 8, further comprising:
a closed rule generating module, configured to generate a first closed semantic constraint rule according to the semantic constraint rule and the knowledge subset;
the directed graph establishing module is used for taking each knowledge subset as a vertex and connecting the vertices according to the first closed semantic constraint rule to obtain a first directed graph;
the optimal vertex selection module is used for selecting a second preset number of first optimal vertices from the vertices according to a preset vertex selection algorithm, and taking the knowledge subsets corresponding to the first optimal vertices as the optimal knowledge subsets;
the first optimal vertex is a vertex whose vertex color cannot be inferred from colors of other vertices, the second preset number is smaller than the first preset number, and the preset vertex selection algorithm is any one of a path-based vertex selection algorithm and a topology-based vertex selection algorithm.
10. The apparatus of claim 9, wherein the denoising module comprises:
the first coloring unit is used for coloring the color of the first optimal vertex corresponding to the task feedback result into a first color when the task feedback result is correct;
the second coloring unit is used for coloring the color of the first optimal vertex corresponding to the task feedback result into a second color when the task feedback result is wrong;
a third coloring unit, configured to color other vertices into the first color or the second color according to the consistent relationship semantic constraint rule of the semantic constraint rules and the contradictory relationship semantic constraint rule;
and the removing unit is used for removing the knowledge subset corresponding to the vertex of which the color on the first directed graph is the second color.
CN201710197975.1A 2017-03-29 2017-03-29 Knowledge refining method and device Active CN106951963B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710197975.1A CN106951963B (en) 2017-03-29 2017-03-29 Knowledge refining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710197975.1A CN106951963B (en) 2017-03-29 2017-03-29 Knowledge refining method and device

Publications (2)

Publication Number Publication Date
CN106951963A CN106951963A (en) 2017-07-14
CN106951963B true CN106951963B (en) 2020-05-22

Family

ID=59475493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710197975.1A Active CN106951963B (en) 2017-03-29 2017-03-29 Knowledge refining method and device

Country Status (1)

Country Link
CN (1) CN106951963B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272003A (en) * 2017-07-17 2019-01-25 华东师范大学 A kind of method and apparatus for eliminating unknown error in deep learning model
CN108596501A (en) * 2018-04-28 2018-09-28 华东师范大学 Method for allocating tasks, device, medium, equipment based on technical ability figure and system
CN110414680A (en) * 2019-07-23 2019-11-05 国家计算机网络与信息安全管理中心 Knowledge system of processing based on crowdsourcing mark
CN111259624B (en) * 2020-01-15 2023-03-31 北京百度网讯科技有限公司 Triple data labeling method and device in knowledge graph
CN111444332B (en) * 2020-03-13 2023-04-18 广州大学 Crowdsourcing worker reliability model establishing method and device under crowdsourcing knowledge verification environment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5412758A (en) * 1991-04-16 1995-05-02 Hewlett-Packard Company Flexible system for knowledge acquisition in expert system development
CN101556606A (en) * 2009-05-20 2009-10-14 同方知网(北京)技术有限公司 Data mining method based on extraction of Web numerical value tables
CN102044018A (en) * 2010-12-13 2011-05-04 北京航空航天大学 Knowledge acquisition template for product reliability design and criteria extracting method
CN102682119A (en) * 2012-05-16 2012-09-19 崔志明 Deep webpage data acquiring method based on dynamic knowledge
WO2013028322A1 (en) * 2011-08-24 2013-02-28 Trese Andrew Systems, methods, and media for controlling the review of a document
CN104866593A (en) * 2015-05-29 2015-08-26 中国电子科技集团公司第二十八研究所 Database searching method based on knowledge graph
CN105677874A (en) * 2016-01-11 2016-06-15 江苏省现代企业信息化应用支撑软件工程技术研发中心 Method and device for integrating extracted web table data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7725418B2 (en) * 2005-01-28 2010-05-25 Honda Motor Co., Ltd. Responding to situations using multidimensional semantic net and Bayes inference

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5412758A (en) * 1991-04-16 1995-05-02 Hewlett-Packard Company Flexible system for knowledge acquisition in expert system development
CN101556606A (en) * 2009-05-20 2009-10-14 同方知网(北京)技术有限公司 Data mining method based on extraction of Web numerical value tables
CN102044018A (en) * 2010-12-13 2011-05-04 北京航空航天大学 Knowledge acquisition template for product reliability design and criteria extracting method
WO2013028322A1 (en) * 2011-08-24 2013-02-28 Trese Andrew Systems, methods, and media for controlling the review of a document
CN102682119A (en) * 2012-05-16 2012-09-19 崔志明 Deep webpage data acquiring method based on dynamic knowledge
CN104866593A (en) * 2015-05-29 2015-08-26 中国电子科技集团公司第二十八研究所 Database searching method based on knowledge graph
CN105677874A (en) * 2016-01-11 2016-06-15 江苏省现代企业信息化应用支撑软件工程技术研发中心 Method and device for integrating extracted web table data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
《Learning to Refine an Automatically Extracted Knowledge Base using Markov Logic》;Shangpu Jiang et al;;《2012 IEEE 12th International Conference on Data Mining》;20121231;第912-917页; *
《Refining Diagnostic Knowledge Extracted from Interferon Therapy by Graph-Based Induction 》;Tetsuya Yoshida;《IEEE》;20051231;第63-68页; *
《挖掘专利知识实现关键词自动抽取》;陈忆群 等;;《计算机研究与发展》;20160831;第53卷(第8期);第1740-1752页; *
《知识图谱构建技术综述》;刘峤 等;;《计算机研究与发展》;20161231;第53卷(第3期);第582-600页; *

Also Published As

Publication number Publication date
CN106951963A (en) 2017-07-14

Similar Documents

Publication Publication Date Title
CN106951963B (en) Knowledge refining method and device
Wienand et al. Detecting incorrect numerical data in dbpedia
Hauke et al. Recent development of social simulation as reflected in JASSS between 2008 and 2014: A citation and co-citation analysis
CN108491302B (en) Method for detecting spark cluster node state
JPH1196010A (en) Sorting device
TWI590095B (en) Verification system for software function and verification mathod therefor
Collaris et al. Instance-level explanations for fraud detection: A case study
Marques-Silva Computing Minimally Unsatisfiable Subformulas: State of the Art and Future Directions.
JP5682448B2 (en) Causal word pair extraction device, causal word pair extraction method, and causal word pair extraction program
CN106778063A (en) A kind of protein complex recognizing method based on graph model
Wang et al. On the use of time series and search based software engineering for refactoring recommendation
CN116432570A (en) Method and device for generating test case of chip and storage medium
CN111444635B (en) System dynamics simulation modeling method and system based on XML language
CN116089504B (en) Relational form data generation method and system
van de Ven et al. Determining capacity of shunting yards by combining graph classification with local search
CN111914772A (en) Method for identifying age, and training method and device of age identification model
Yevseyeva Solving classification problems with multicriteria decision aiding approaches
Kessentini et al. Improving web services design quality using heuristic search and machine learning
Flint et al. Perceptron learning of SAT
Compton Simulating expertise
CN112835797A (en) Metamorphic relation prediction method based on program intermediate structure characteristics
Menezes et al. Automatic discovery of agent based models: An application to social anthropology
US7650579B2 (en) Model correspondence method and device
Martinez et al. A model for detecting conflicts and dependencies in non-functional requirements using scenarios and use cases
Njah et al. A new equilibrium criterion for learning the cardinality of latent variables

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230629

Address after: Unit 5-A3F, Creative Industry Park, No. 328, Xinghu Street, Suzhou Industrial Park, China (Jiangsu) Pilot Free Trade Zone, Suzhou City, Jiangsu Province, 215000

Patentee after: Suzhou Chuhui Intelligent Technology Co.,Ltd.

Address before: 215123 No. 199 benevolence Road, Suzhou Industrial Park, Jiangsu, China

Patentee before: SOOCHOW University