CN113535967B - Chinese universal concept map error correction device - Google Patents

Chinese universal concept map error correction device Download PDF

Info

Publication number
CN113535967B
CN113535967B CN202010303271.XA CN202010303271A CN113535967B CN 113535967 B CN113535967 B CN 113535967B CN 202010303271 A CN202010303271 A CN 202010303271A CN 113535967 B CN113535967 B CN 113535967B
Authority
CN
China
Prior art keywords
concept
concepts
isa
incompatible
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010303271.XA
Other languages
Chinese (zh)
Other versions
CN113535967A (en
Inventor
方世能
刘井平
肖仰华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202010303271.XA priority Critical patent/CN113535967B/en
Publication of CN113535967A publication Critical patent/CN113535967A/en
Application granted granted Critical
Publication of CN113535967B publication Critical patent/CN113535967B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Chinese universal concept map error correction device, which is used for correcting the isA relation between the entity and the concept in the Chinese universal concept map, and is characterized by comprising the following steps: the concept map acquisition module is used for acquiring all the isA relationships and corresponding entities and concepts in the concept map; the incompatible concept pair building module is used for sequentially judging whether every two concepts are compatible and building a plurality of groups of incompatible concept pairs based on all the incompatible two concepts, wherein each group of incompatible concept pair comprises an entity serving as a suspicious entity, two concepts serving as concepts to be judged and two corresponding isA relations serving as isA relations to be corrected; the error isA relation judgment module is used for judging one error in the corresponding isA relation to be corrected based on the suspicious entities in each group of incompatible concept pairs and the concepts to be judged in sequence; and the concept map error correction module is used for deleting the isA relation to be corrected of the determined error in the concept map so as to finish the error correction of the concept map.

Description

Chinese universal concept map error correction device
Technical Field
The invention belongs to the field of knowledge map quality control, and particularly relates to an error correction device for a Chinese general concept knowledge map.
Background
Concept maps are a class of knowledge maps that focus on the isA relationships between entities and concepts. The concept graph includes 3 elements: entities, concepts and isA relationships. Where the isA relationships can be subdivided into instanceOf relationships between entities and concepts (e.g., apple is a fruit) and subClass relationships between concepts and concepts (e.g., fruit is a food). The concept graph has important application in tasks such as text classification, entity recommendation, rule mining and the like. However, the concept graph is often constructed by automatically extracting the isA relationship from the internet corpus, so that the upper-lower relation of errors is difficult to avoid due to noise interference, and therefore, the error correction of the concept graph is to remove the wrong isA from the ten-million-level isA relationship, and the accuracy of the concept graph is improved.
In the prior art, conceptual map error correction methods can be divided into two categories:
1) the method based on Embedding. The method comprises the steps of firstly crawling a large-scale corpus from the Internet, extracting the isA relationship by applying vocabulary syntax rules such as WebISA and PatterSinm, calculating the confidence coefficient of the isA relationship in a concept graph in a Poincare Embedding mode corresponding to the existing concept graph, and finally removing the isA relationship with lower confidence coefficient.
2) A method based on incompatible concept pairs. The core idea of the method is that for entities belonging to the same pair of conflicting concept pairs, at least one of the isA relationships is erroneous. For example, in a concept graph, there are both (a Ali film industry isA company) and (a Ali film industry isA movie), since "company" and "movie" are a pair of incompatible concepts, at least one of the isA relationships is wrong. The method can be divided into two phases, phase 1 is to construct incompatible concept pairs, and phase 2 is to remove the wrong isA relationships. The method for constructing the incompatible concept pair comprises the steps of Jerad distance, cosine similarity and Hamming distance, and the method for removing the wrong isA relation comprises the method based on frequency and KL divergence based on attributes.
In the above-mentioned technology of correcting conceptual atlas, the method based on Embedding needs to extract a large amount of corpora and extract the isA relationship using the lexical syntactic rule, but because crawling the corpora takes time and labor, and chinese grammar is complex, expression is diverse, and there is no extraction rule for english isA relationship, and in addition, the wrong isA relationship removed by the way of Embedding has lower accuracy. Based on the incompatible concept pair method, the method for removing the frequency in the error isA relationship in the stage 2 is difficult to realize the correction of the error isA relationship by removing the lower frequency isA relationship because the isA relationship in the Chinese concept map lacks the corresponding frequency; the method based on the KL divergence of the attributes cannot effectively distinguish which is the wrong isA relationship for a part of the entities because the attributes of the entities have different degrees of importance, and is difficult to correct errors on a large scale because most of the entities lack attribute information.
Disclosure of Invention
In order to solve the problems, the invention provides a Chinese universal concept map error correction method which can identify and correct the wrong isA relation with high accuracy under the Chinese environment, and the invention adopts the following technical scheme:
the invention provides a Chinese universal concept map error correction device, which is used for correcting the isA relation between the entity and the concept in the Chinese universal concept map, and is characterized by comprising the following steps: the concept map acquisition module is used for acquiring all the isA relations in the concept map and entities and concepts which are uniquely corresponding to each isA relation; the incompatible concept pair building module is used for sequentially judging whether each two concepts are compatible and building a plurality of groups of incompatible concept pairs based on the entities shared by the two incompatible concepts and the corresponding isA relations, wherein each group of incompatible concept pairs comprises an entity serving as a suspicious entity, two concepts serving as concepts to be judged and two corresponding isA relations serving as isA relations to be corrected; the error isA relation judgment module is used for judging one error in the corresponding isA relation to be corrected based on the suspicious entities in each group of incompatible concept pairs and the concepts to be judged in sequence; and a concept map error correction module for deleting the isA relationship to be error-corrected for the determined error in the concept map to thereby complete error correction of the concept map, wherein the error isA relationship determination module has: an encyclopedic entry determination unit which acquires an entry label list of encyclopedic entries of a suspicious entity, determines whether two concepts to be determined belong to the entry label list, and further determines that an isA relationship to be corrected corresponding to a concept to be determined which does not belong to the entry label list is incorrect if one of the two concepts to be determined does not belong to the entry label list; and the semi-supervised classification judgment part is used for filtering the suspicious entities based on the key features corresponding to the concepts to be judged, filtering the residual suspicious entities based on the pre-trained BERT classifier so as to judge the wrong concepts to be judged in all incompatible concepts, and further judging the ISA relation to be corrected corresponding to the wrong concepts to be judged as wrong.
The Chinese generic concept map error correction apparatus provided by the present invention may further have a technical feature in which the semi-supervised classification determining part has: the key feature filtering unit is used for filtering the suspicious entity based on the key features corresponding to the concepts to be judged, if the suspicious entity has one key feature of the concepts to be judged and does not have the key feature of another concept to be judged, the concepts to be judged corresponding to the key features which the suspicious entity does not have are judged to be wrong, the key features corresponding to each concept to be judged are acquired in advance, and the key feature acquiring method comprises the following steps: and counting the lower entity attributes corresponding to the two concepts to be judged in a pre-acquired training set, and respectively taking the lower entity attributes with the highest frequency and not shared by the two concepts to be judged as the key features of the corresponding concepts to be judged.
The chinese generic concept map error correction apparatus according to the present invention may further have a technical feature in which the semi-supervised classification determining part further has: and the BERT classification judgment unit stores a pre-trained BERT classifier, is used for sequentially inputting the remaining suspicious entities into the BERT classifier and obtaining the probability distribution of the concept to which each suspicious entity belongs, and judges one of the two concepts to be judged, which correspond to the suspicious entities, which is wrong based on the probability distribution of the concept to which the suspicious entities belong.
The Chinese universal concept map error correction device provided by the invention can also have the technical characteristics that the BERT classifier adopts a transform bidirectional encoder framework, multiple layers of transform blocks are stacked to extract the depth relation between tokens and tokens in a sequence, the semantic relation between tokens is strengthened in each transform block through a multi-head attention mechanism, and the output of a transform layer is obtained after passing through a feedforward network layer.
The Chinese universal concept map error correction device provided by the invention can also have the technical characteristics that the incompatible concept pair construction module constructs the incompatible concept pair by using a MiniJaccard coefficient and the concept attribute distribution similarity, wherein the MiniJaccard coefficient is as follows:
Figure BDA0002454807600000041
wherein, | c1|,|c2Respectively representing concept c1,c2The number of lower entities, | c1∩c2I represents concept c1,c2Number of common lower entities, concept attribute distribution similarity CPD (c)1,c2) In order to realize the purpose,
Figure BDA0002454807600000042
in the formula, the vectors x, y are concepts c1,c2Distribution of properties of, concept c1,c2The compatibility of (c) is expressed as:
Figure BDA0002454807600000051
Figure BDA0002454807600000052
if compatibility P (c)1,c2) If the compatibility threshold is lower than the preset compatibility threshold, the incompatible concept pair building module is based on the concept c1,c2Corresponding incompatible concept pairs are constructed.
Action and Effect of the invention
According to the Chinese universal concept map error correction device, the compatibility among all concepts in the concept map is judged through the incompatible concept pair construction module and corresponding multiple groups of incompatible concept pairs are constructed, so that all suspicious upper and lower relations in the concept map can be quickly positioned, and the error isA relation judgment module can sequentially judge each group of incompatible concept pairs and judge the error isA relation. On one hand, the encyclopedic entry judging part can determine the incorrect isA relation in incompatible concepts in the pair by retrieving the entry label updated by the encyclopedic of the suspicious entity, and the method is simple and efficient and can quickly identify a small part of incorrect isA relation; on the other hand, the semi-supervised classification judgment part can accurately judge the wrong isA relation among incompatible concepts by constructing the key features of the concepts to identify the upper concepts of partial suspicious entities so as to determine the wrong isA relation and identifying the rest suspicious entities through the BERT classifier so as to determine the wrong isA relation. Therefore, the Chinese universal concept map error correction device can screen and correct all the incompatible concept pairs with errors in the concept map, eliminate the wrong isA relation in the concept map, form the concept map with high accuracy and facilitate other follow-up personnel or systems to effectively call the concept map.
Drawings
FIG. 1 is a block diagram of a Chinese generic concept map error correction apparatus according to an embodiment of the present invention;
FIG. 2 is a flowchart of a semi-supervised classification algorithm of the semi-supervised classification decision section in an embodiment of the present invention; and
FIG. 3 is a flowchart of a method for correcting errors of a Chinese generic concept graph according to an embodiment of the present invention.
Detailed Description
In order to make the technical means, creation features, achievement purposes and effects of the invention easy to understand, the following describes the Chinese general concept map error correction method of the invention in detail with reference to the embodiments and the accompanying drawings.
< example >
In this embodiment, the chinese generic concept map error correction apparatus 100 is a computer in which a constructed concept map is stored in advance, and is used to automatically correct an isA relationship of an error in the concept map.
Fig. 1 is a block diagram of a structure of a chinese generic concept map error correction apparatus in an embodiment of the present invention.
As shown in fig. 1, the chinese general concept graph error correction apparatus 100 specifically includes a concept graph acquisition module 101, an incompatible concept pair construction module 102, an error isA relationship determination module 103, a concept graph error correction module 104, and a control unit 105 for controlling the above-described respective units.
The control unit 105 stores a computer program for controlling the operations of the respective components of the chinese generic concept map error correction apparatus 100.
The concept graph obtaining module 101 is used for obtaining all the isA relationships in the concept graph and the entity and concept uniquely corresponding to each isA relationship,
In this embodiment, the concept graph represents an entity and a top-bottom relationship between concepts through the isA relationship (the bottom level of the concept is the entity, and the top level of the entity is the concept), and a plurality of triples (e, isA, c) are formed among the entity, the concepts and the isA relationship.
Incompatible concept pair construction module 102 is configured to sequentially determine whether each two concepts are compatible and construct sets of incompatible concept pairs based on entities common to all of the incompatible two concepts and the corresponding isA relationships.
In this embodiment, each incompatible concept pair includes an entity as a suspicious entity e and a concept c to be determined1And c2And two corresponding isA relationships as the isA relationships to be corrected, i.e. a set of incompatible pairs of concepts contains two triplets (e, isA, c)1) And (e, isA, c)2). Due to the concept c to be determined1And c2To be incompatible, there must therefore be an erroneous isA relationship to be corrected in both triplets. For example, for a pair of incompatible concepts "movie" and "company", the presentity "the Alice movie" belongs to both concepts, indicating that there must be an incorrect isA relationship.
The purpose of constructing incompatible concept pairs is to quickly locate suspicious top-bottom relationships. The general concept map has tens of millions of isA relationships due to wide coverage, and how to determine the position of the wrong superior-inferior relationship from tens of millions of records is a first step to be considered. Therefore, in the present embodiment, the incompatible concepts are measured by the building module 102 using MiniJaccard and F1 value of similarity of concept attribute distribution. Specifically, the method comprises the following steps:
concept c1,c2Can characterize the similarity of the concept pair, measured by MiniJaccard:
Figure BDA0002454807600000071
wherein, | c1|,|c2Respectively representing concept c1,c2The number of lower entities, | c1∩c2I represents concept c1,c2The number of lower entities in common.
Concept c1,c2The distribution of the entity attributes of (c) can also describe the similarity of concept pairs, the entity attribute distribution of the concept is all attribute sets of lower entities, and the similarity of the concept attribute distribution is expressed as CPD (c)1,c2) Namely:
Figure BDA0002454807600000081
in the formula, the vectors x, y are concepts c1,c2The attribute distribution of (2).
Concept c1,c2The compatibility of (c) is expressed as:
Figure BDA0002454807600000082
therefore, after calculating the compatibility between each concept pair, the incompatible concept pair construction module 102 can determine the concept c1,c2Whether the compatibility is lower than a preset compatibility threshold value is further based on an incompatible concept c of which the compatibility is lower than the compatibility threshold value1,c2A corresponding set of incompatible concept pairs is constructed.
The error isA relationship determination module 103 is configured to determine one error of the corresponding two isA relationships to be corrected based on the suspicious entities in each set of incompatible concept pairs and the concepts to be determined in sequence.
In this embodiment, the function of the erroneous isA relationship determination module 103 is to determine the upper and lower relationships of the error, that is, for the entity "ari movie industry" belonging to the same pair of incompatible concepts "movie" and "company", the erroneous isA relationship determination module 103 needs to determine the erroneous isA relationship (ari movie industry). In order to ensure the comprehensiveness and accuracy of the judgment of the isA relationship, the embodiment adopts two judgment modes to realize the discovery of the wrong upper and lower bit relationships: the error isA relationship determination module 103 includes an encyclopedia term determination unit 31 and a semi-supervised classification determination unit 32, respectively, in accordance with an encyclopedia update-supported semi-supervised classification algorithm based on key features.
The encyclopedia term determination unit 31 is used to identify an isA relationship that is incompatible with an error in a superordinate word by searching for an entry label updated by the encyclopedia of the suspicious entity.
An entity in the concept graph is provided with an encyclopedia entry, the entry is provided with information such as description, attributes and entry labels, and in the construction process of the concept graph, for example CN-base, the superior word of the entity comes from extraction of the description, partial attributes and entry labels. According to sampling finding, most of wrong upper and lower relations are derived from the entry labels, and the entry labels of some entities are manually modified and become more accurate due to the fact that the construction time of the concept map is earlier. Therefore, whether the entry label of the entity is updated or not is checked firstly, so that the partial error upper and lower relation is obtained.
Specifically, the encyclopedic entry determination unit 31 acquires an entry tag list of encyclopedic entries of an entity (for example, the encyclopedic entry can be crawled by a crawler), determines whether or not two concepts to be determined belong to the entry tag list, and further determines that the isA relationship to be corrected corresponding to a concept to be determined which does not belong to the entry tag list is incorrect if one of the two concepts to be determined does not belong to the entry tag list. I.e. for suspicious entity e and concept c to be determined1And c2The encyclopedic entry determination unit 31 acquires the entry tag list con _ list of the suspicious entity e, and if the concept c is found1Belong to con _ list and concept c2Not con _ list, states (e, isA, c)2) Is an erroneous context and should be deleted, and vice versa.
In the present embodiment, a small number of erroneous isA relationships can be quickly determined by the above-described encyclopedic update support (encyclopedic entry determining section 31), but since most of the entry labels of suspicious entities have not changed, the determination is performed next using a semi-supervised classification algorithm based on key features (semi-supervised classification determining section 32).
In this embodiment, for a pair of incompatible concepts to be determined, such as "movie" and "company", 10000 lower entities each extracting two concepts from the concept graph constitute a training set D. And taking suspicious entities belonging to a group of incompatible concept pairs as a test set A to be judged.
Fig. 2 is a flowchart of a semi-supervised classification algorithm of the semi-supervised classification determination section in the embodiment of the present invention.
As shown in fig. 2, the semi-supervised classification determination section 32 sequentially determines the isA relationship to be corrected for errors among the sets of incompatible concepts by a semi-supervised classification algorithm based on key features, and the semi-supervised classification determination section 32 has a key feature filtering unit 32(a), a BERT classification determination unit 32(b), and an isA relationship determination section 32 (c).
The key feature filtering unit 32(a) filters suspicious entities based on key features corresponding to the concepts to be determined.
In this embodiment, the attributes of the entity are used as the characteristics of the suspicious entity, so-called key characteristics, that is, when an entity has a concept c1The key feature of (2) is necessarily the concept c1For example, for the entity "johnes", its corresponding incompatible concepts to be determined are "character" and "game", but since the entity "johnes" has the attribute "date of birth", the entity must belong to "character" instead of "game". By means of the key feature filtering, partial error isA relations of the test set a can be judged, and first, key features of incompatible concept pairs need to be constructed.
The key characteristic acquisition method comprises the following steps: and counting the lower entity attributes corresponding to the two concepts to be judged in a pre-acquired training set, and respectively taking the lower entity attributes with the highest frequency and not shared by the two concepts to be judged as the key features of the corresponding concepts to be judged.
For example, for concept c1And concept c2And counting the lower entity attributes of the two concepts in the training set D, and taking 10 attributes which have the highest frequency and are not common to the two concepts as the key features c1_ list and c2_ list of the concepts respectively. At this time, the key feature filtering unit 32(a) may filter the suspicious entity according to the key feature, and if the suspicious entity has one key feature of the concept to be determined and does not have another key feature of the concept to be determined, determine that the suspicious entity has the key feature of the concept to be determined and does not have the key feature of the concept to be determinedDetermining that the concept to be determined corresponding to the key feature not possessed by the suspicious entity is wrong, i.e., if the entity e possesses the concept c1Without concept c2Is that entity e belongs to concept c1,(e,isA,c2) Is the wrong upper and lower bit relationship and vice versa. No filtering is applied to suspicious entities that have or do not have both key features of both concepts.
By means of the key feature filtering, 35% of suspicious entities in the test set A can be filtered, and the accuracy is 99%.
The BERT classification determining unit 32(b) stores a pretrained BERT classifier, and is configured to sequentially input the remaining suspicious entities into the BERT classifier, obtain probability distribution of a concept to which each suspicious entity belongs, and determine, based on the probability distribution of the concept to which the suspicious entity belongs, an incorrect one of two concepts to be determined corresponding to the suspicious entity.
For the remaining 65% of suspicious entities in test set a, the entity description is classified using the BERT classifier. The entity description is a brief introduction of an entity, and the reason why the BERT classification is not applied to suspicious entities of all test sets in this embodiment is that: most entities have wrong hypernyms because the hypernyms are often concepts related to the entities during extraction, for example, the entity "ari movie" is related to the concept "movie", but not "movie"; "Jones" is related to the concept "game" but not "game", so if a text classifier is trained simply using training data in a concept graph, it cannot effectively distinguish which concept an entity belongs to, because the training data lacks entities related to both concepts, i.e., the training data and the test data are not distributed in the same way, resulting in poor performance of the classifier on the test set.
Therefore, the semi-supervised classification determination unit 32 employs a semi-supervised classification algorithm based on key features. In this embodiment, the key feature filtering unit 32(a) determines the to-be-determined concepts corresponding to a part of suspicious entities by using a rule filtering method of key features, that is, determines labels (pseudo labels) of a part of test sets, so that the to-be-determined concepts can be added into the training set D to form a training set D ', and then a BERT classifier is used to train on the training set D'.
BERT as a pre-training language model has strong semantic representation capability, and the BERT training has two stages: pre-training and fine-tuning. The pre-training stage BERT is represented in a large-scale corpus through the depth of unsupervised prediction task learning language, and the fine-tuning stage BERT uses parameters after pre-training to realize fine training of specific tasks, so that the method is suitable for sub-tasks such as classification, matching, extraction and the like.
In this embodiment, BERT adopts a transform bidirectional encoder architecture, multiple layers of transform blocks are stacked to extract a depth relationship between tokens and tokens in a sequence, semantic association between tokens is strengthened in each transform block through a multi-head attention mechanism, and output of a transform layer is obtained after passing through a feedforward network layer. The pre-training phase BERT has two tasks: mask language model (Masked LM) and Next Sentence Prediction (NSP). To train a deep bi-directional representation, BERT takes the random MASK part to input tokens and let the model predict these tokens as learning targets for the Masked LM task, specifically BERT randomly MASKs 15% of the tokens in each sequence, where 80% of these tokens are replaced by [ MASK ], 10% of the time is replaced by a random token, and the remaining 10% of the time remains unchanged. This input is encoded by a Transformer, using the entire sequence encoded representation to predict the masked token.
The next sentence predicting task generates two adjacent or non-adjacent sentences from a corpus in advance, then carries out binary classification on the sentence pairs, judges whether the sentence 2 is the next sentence of the sentence 1, and is used for leading BERT to understand the relationship between the two sentences, wherein the construction mode of the sentence pairs of non-NSP is that one sentence is randomly sampled from two different documents to form a training sample of the non-NSP. BERT predicts two pre-trained tasks through a shielding language model and the next sentence, learns the semantic representation of natural language, and facilitates the fine adjustment of downstream tasks.
In this embodiment, the performance of the fine-tuning BERT in text classification is used to complete the determination of the false context relationship, where BERT's input is the description of an entity, output is the probability distribution of which hypernym belongs to, cross entropy is used as the loss function of the classification task, and Adam optimizer is used as the training optimizer, specifically, since test set a has entities more similar to test set B, higher weight is assigned to the loss of test set a in training set D ' when the BERT classifier is trained, so that the classifier can better fit the entity classification according to the description in test set a, and the loss function of training set D ' is:
LD′=LD+λLA
in the formula, λ is a hyper-parameter, which represents the balance between the training set D and the pseudo label test set a. The experiment proves that the accuracy of the test set B is 87.6% when the lambda is equal to 1, and the accuracy of the test set B is 94.4% when the lambda is equal to 7.
After the training process is completed, the BERT classification determining unit 32(b) may filter the remaining suspicious entities based on the trained BERT classifier, and determine one of the two to-be-determined concepts corresponding to each suspicious entity that is in error.
The isA relationship determination unit 32(c) can determine that the corresponding isA relationship to be corrected is erroneous based on the concepts to be determined of errors corresponding to suspicious entities filtered by the key feature filtering unit 32(a) and the BERT classification determination unit 32 (b).
The concept map error correction module 104 is configured to delete the isA relationship to be corrected, which is determined as an error by the error isA relationship determination module 103, from the concept map, thereby completing error correction of the concept map.
FIG. 3 is a flowchart of a method for correcting errors of a Chinese generic concept graph according to an embodiment of the present invention.
As shown in fig. 3, when a user starts the chinese generic concept map error correction apparatus 100 and performs error correction processing on a stored (or input) concept map, the specific error correction process is as follows:
step S1, the concept graph obtaining module 101 obtains all isA relationships and corresponding entities and concepts in the concept graph, and then proceeds to step S2;
step S2, the incompatible concept pair construction module 102 sequentially calculates compatibility between each two concepts and constructs incompatible concept pairs based on the two concepts judged to be incompatible to form a plurality of sets of incompatible concept pairs, and then proceeds to step S3;
step S3, the error isA relationship determination module 103 sequentially determines one to-be-determined concept that is an error in each group of incompatible concept pairs, determines that the corresponding to-be-corrected isA relationship is erroneous, and then proceeds to step S4;
in step S4, the concept map error correction module 104 deletes the isA relationship to be error-corrected that is determined to be erroneous in the concept map to complete error correction of the concept map, and then enters an end state.
Through the process, the isA relations of all errors in the concept graph can be screened and corrected, and the finally formed concept graph with high accuracy can be stored in a computer, so that other programs or users can call the concept graph conveniently.
Examples effects and effects
According to the Chinese general concept map error correction device provided by the embodiment, because the compatibility among all concepts in the concept map is judged through the incompatible concept pair construction module and a plurality of groups of corresponding incompatible concept pairs are constructed, all suspicious upper and lower relations in the concept map can be quickly positioned, so that the error isA relation judgment module can judge each group of incompatible concept pairs in sequence and judge the error isA relation. On one hand, the encyclopedic entry judging part can determine the incorrect isA relation in incompatible concepts in the pair by retrieving the entry label updated by the encyclopedic of the suspicious entity, and the method is simple and efficient and can quickly identify a small part of incorrect isA relation; on the other hand, the semi-supervised classification judgment part can accurately judge the wrong isA relation among incompatible concepts by constructing the key features of the concepts to identify the upper concepts of partial suspicious entities so as to determine the wrong isA relation and identifying the rest suspicious entities through the BERT classifier so as to determine the wrong isA relation. Therefore, the Chinese universal concept map error correction device can screen and correct all the incompatible concept pairs with errors in the concept map, eliminate the wrong isA relation in the concept map, form the concept map with high accuracy and facilitate other follow-up personnel or systems to effectively call the concept map.
In addition, in the embodiment, after the semi-supervised classification judgment part identifies the upper concept of the suspicious entity by constructing the key features, the upper concept determined by the key features can be used as a pseudo label of the suspicious entity to be added into a training set, and higher loss weight is given, so that the distribution of the training sample and the test sample is more consistent, and higher accuracy is achieved on the test set when the BERT classifier is trained. The accuracy of the error hypernym obtained by integrating the key features and the BERT classifier is 96.1 percent based on the key feature semi-supervised classification algorithm.
The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

Claims (5)

1. A Chinese universal concept map error correction device is used for correcting the isA relation between entities and concepts in a Chinese universal concept map, and is characterized by comprising the following steps:
a concept graph obtaining module, configured to obtain all the isA relationships in the concept graph and the entity and the concept that each isA relationship uniquely corresponds to;
an incompatible concept pair construction module which sequentially judges whether each two concepts are compatible and constructs a plurality of groups of incompatible concept pairs based on the entities shared by all the incompatible two concepts and the corresponding isA relations, wherein each group of incompatible concept pairs comprises one entity serving as a suspicious entity, two concepts serving as concepts to be judged and two corresponding isA relations serving as isA relations to be corrected;
an error isA relationship determination module, configured to determine, based on the suspicious entities in each set of incompatible concept pairs and the to-be-determined concept, an error in one of the two corresponding isA relationships to be corrected; and
a concept map error correction module for deleting the isA relationship to be error-corrected determined to be erroneous in the concept map so as to complete error correction of the concept map,
wherein the error isA relationship determination module has:
an encyclopedic entry determination unit which acquires an entry tag list of encyclopedic entries of the suspicious entity, determines whether the two to-be-determined concepts belong to the entry tag list, and further determines that the to-be-corrected isA relationship corresponding to the to-be-determined concept which does not belong to the entry tag list is incorrect if one of the two to-be-determined concepts does not belong to the entry tag list; and
and the semi-supervised classification judgment part filters the suspicious entities based on the key features corresponding to the concepts to be judged, filters the remaining suspicious entities based on a pre-trained BERT classifier so as to judge the concepts to be judged which are wrong in all the incompatible concept pairs, and further judges the ISA relations to be corrected corresponding to the wrong concepts to be judged as wrong.
2. The chinese generic concept map error correction apparatus according to claim 1, wherein:
wherein the semi-supervised classification determining section includes:
a key feature filtering unit configured to filter the suspicious entity based on key features corresponding to the to-be-determined concept, and if the suspicious entity has the key feature of one of the to-be-determined concepts and does not have the key feature of another of the to-be-determined concepts, determine that the to-be-determined concept corresponding to the key feature that the suspicious entity does not have is erroneous,
the key features corresponding to each concept to be determined are obtained in advance, and the method for obtaining the key features comprises the following steps:
counting the lower entity attributes corresponding to the two to-be-determined concepts in a pre-acquired training set, and respectively taking the n lower entity attributes with the highest frequency and not shared by the two to-be-determined concepts as the key features of the corresponding to-be-determined concepts.
3. The chinese generic concept map error correction apparatus according to claim 1, wherein:
wherein the semi-supervised classification determining section includes:
and the BERT classification judgment unit stores a pre-trained BERT classifier, is used for sequentially inputting the remaining suspicious entities into the BERT classifier, obtaining the probability distribution of the concept to which each suspicious entity belongs, and judging the wrong one of the two concepts to be judged corresponding to the suspicious entities based on the probability distribution of the concept to which the suspicious entity belongs.
4. The Chinese generic concept map error correction device according to claim 3, wherein:
the BERT classifier adopts a transform bidirectional encoder architecture, multiple layers of transform blocks are stacked to extract the depth relation between tokens and tokens in a sequence, semantic association among tokens is strengthened in each transform block through a multi-head attention mechanism, and the output of a transform layer is obtained after passing through a feedforward network layer.
5. The chinese generic concept map error correction apparatus according to claim 1, wherein:
wherein the incompatible concept pair constructing module constructs the incompatible concept pair by using a MiniJaccard coefficient and a concept attribute distribution similarity, the MiniJaccard coefficient being:
Figure FDA0002454807590000031
wherein, | c1|,|c2Respectively representing concept c1,c2The number of lower entities, | c1∩c2| represents the concept c1,c2The number of lower entities in common is,
concept attribute distribution similarity CPD (c)1,c2) Comprises the following steps:
Figure FDA0002454807590000032
in which the vectors x, y are the concepts c, respectively1,c2The distribution of the properties of (a) is,
the concept c1,c2The compatibility of (c) is expressed as:
Figure FDA0002454807590000041
if the compatibility P (c)1,c2) If the compatibility threshold is lower than the preset compatibility threshold, the incompatible concept pair construction module is based on the concept c1,c2Constructing corresponding pairs of said incompatible concepts.
CN202010303271.XA 2020-04-17 2020-04-17 Chinese universal concept map error correction device Active CN113535967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010303271.XA CN113535967B (en) 2020-04-17 2020-04-17 Chinese universal concept map error correction device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010303271.XA CN113535967B (en) 2020-04-17 2020-04-17 Chinese universal concept map error correction device

Publications (2)

Publication Number Publication Date
CN113535967A CN113535967A (en) 2021-10-22
CN113535967B true CN113535967B (en) 2022-02-22

Family

ID=78093425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010303271.XA Active CN113535967B (en) 2020-04-17 2020-04-17 Chinese universal concept map error correction device

Country Status (1)

Country Link
CN (1) CN113535967B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113869049B (en) * 2021-12-03 2022-03-04 北京大学 Fact extraction method and device with legal attribute based on legal consultation problem

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808525A (en) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
CN109086356A (en) * 2018-07-18 2018-12-25 哈尔滨工业大学 The incorrect link relationship diagnosis of extensive knowledge mapping and modification method
CN110472107A (en) * 2019-08-22 2019-11-19 腾讯科技(深圳)有限公司 Multi-modal knowledge mapping construction method, device, server and storage medium
CN110704634A (en) * 2019-09-06 2020-01-17 平安科技(深圳)有限公司 Method and device for checking and repairing knowledge graph link errors and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8762375B2 (en) * 2010-04-15 2014-06-24 Palo Alto Research Center Incorporated Method for calculating entity similarities

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
CN105808525A (en) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN109086356A (en) * 2018-07-18 2018-12-25 哈尔滨工业大学 The incorrect link relationship diagnosis of extensive knowledge mapping and modification method
CN110472107A (en) * 2019-08-22 2019-11-19 腾讯科技(深圳)有限公司 Multi-modal knowledge mapping construction method, device, server and storage medium
CN110704634A (en) * 2019-09-06 2020-01-17 平安科技(深圳)有限公司 Method and device for checking and repairing knowledge graph link errors and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Graph-Based Wrong IsA Relation Detection in a Large-Scale Lexical Taxonomy;Jiaqing Liang,Yanghua Xiao;《Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)》;20170212;全文 *

Also Published As

Publication number Publication date
CN113535967A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
US9239875B2 (en) Method for disambiguated features in unstructured text
CN109947952B (en) Retrieval method, device, equipment and storage medium based on English knowledge graph
US20170169355A1 (en) Ground Truth Improvement Via Machine Learned Similar Passage Detection
CN107301163B (en) Formula-containing text semantic parsing method and device
Huo et al. Semparser: A semantic parser for log analytics
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN116992005B (en) Intelligent dialogue method, system and equipment based on large model and local knowledge base
CN114817570A (en) News field multi-scene text error correction method based on knowledge graph
CN108763211A (en) The automaticabstracting and system of knowledge are contained in fusion
CN111723870B (en) Artificial intelligence-based data set acquisition method, apparatus, device and medium
CN114661872A (en) Beginner-oriented API self-adaptive recommendation method and system
CN111369980A (en) Voice detection method and device, electronic equipment and storage medium
CN112883182A (en) Question-answer matching method and device based on machine reading
Utomo et al. New instances classification framework on Quran ontology applied to question answering system
CN113535967B (en) Chinese universal concept map error correction device
CN115114419A (en) Question and answer processing method and device, electronic equipment and computer readable medium
CN117454884B (en) Method, system, electronic device and storage medium for correcting historical character information
CN117390198A (en) Method, device, equipment and medium for constructing scientific and technological knowledge graph in electric power field
CN115757695A (en) Log language model training method and system
CN113239143B (en) Power transmission and transformation equipment fault processing method and system fusing power grid fault case base
CN113128224B (en) Chinese error correction method, device, equipment and readable storage medium
CN114298042A (en) Entity linking method, entity linking model training method and electronic equipment
Chandra et al. An Enhanced Deep Learning Model for Duplicate Question Detection on Quora Question pairs using Siamese LSTM
CN113704422A (en) Text recommendation method and device, computer equipment and storage medium
CN111428475A (en) Word segmentation word bank construction method, word segmentation method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant