CN109783650B - Chinese network encyclopedia knowledge denoising method, system and knowledge base - Google Patents

Chinese network encyclopedia knowledge denoising method, system and knowledge base Download PDF

Info

Publication number
CN109783650B
CN109783650B CN201910024995.8A CN201910024995A CN109783650B CN 109783650 B CN109783650 B CN 109783650B CN 201910024995 A CN201910024995 A CN 201910024995A CN 109783650 B CN109783650 B CN 109783650B
Authority
CN
China
Prior art keywords
similarity
knowledge
triple
encyclopedia
triples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910024995.8A
Other languages
Chinese (zh)
Other versions
CN109783650A (en
Inventor
王汀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Original Assignee
CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS filed Critical CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Priority to CN201910024995.8A priority Critical patent/CN109783650B/en
Publication of CN109783650A publication Critical patent/CN109783650A/en
Application granted granted Critical
Publication of CN109783650B publication Critical patent/CN109783650B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a Chinese network encyclopedia knowledge denoising method, a Chinese network encyclopedia knowledge denoising system and a knowledge base, and belongs to the technical field of computers. The Chinese network encyclopedia knowledge denoising method is based on the fusion of an editing distance and a synonym forest method, an Infobox knowledge Triple (Triple) data field is constructed by means of vocabulary labels of Chinese encyclopedia to denoise massive knowledge triples, the purpose is to reduce the phenomenon of massive repeated ambiguity in a Chinese open encyclopedia knowledge base as much as possible, and the problems that semantic repetition, synonym and knowledge triples are improperly classified and the like when the Chinese knowledge base is constructed in the prior art are solved.

Description

Chinese network encyclopedia knowledge denoising method, system and knowledge base
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a Chinese network encyclopedia knowledge denoising method, a Chinese network encyclopedia knowledge denoising system and a knowledge base.
Background
With the development and application of computer network, mobile internet and other technologies, the internet gradually becomes a main platform for people to publish, exchange and share information. The information inquiry, knowledge acquisition and skill learning gradually realize the conversion from offline to online, the storage structure is converted from text to semi-structured and formatted storage formats, novel information carriers such as online encyclopedia and encyclopedia websites are rapidly developed, and the storage data volume is rapidly accumulated and increased. The construction of knowledge bases as an important set of knowledge for storing, organizing and handling knowledge and for providing knowledge services is becoming the basis for various industries to develop knowledge management and knowledge services.
However, the Chinese network encyclopedia knowledge inventory has the problem of low efficiency and accuracy due to the fact that a large number of repeated and synonymous concepts exist among Chinese word concepts, the knowledge is not properly classified and the like.
Disclosure of Invention
In order to solve the problems of repeated semantics, improper classification of knowledge triples and the like in a Chinese network encyclopedia knowledge base in the prior art, the invention provides a Chinese network encyclopedia knowledge denoising method, a Chinese network encyclopedia knowledge denoising system and a knowledge base, and the method has the characteristics of high precision ratio and the like.
The invention provides the following technical scheme:
in one aspect, a method for denoising knowledge of Chinese network encyclopedia comprises the following steps:
collecting original data in the open encyclopedia resources of the Chinese network;
crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for the vocabulary entry to which a preset concept belongs based on the original data;
crawling entry labels of the Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;
calculating the initial similarity of the Infobox knowledge triple;
adding semantic distance to the Infobox knowledge triple label, and acquiring Infobox knowledge triple target similarity through a data field according to a preset method according to the initial similarity;
and carrying out knowledge denoising according to the similarity of the Infobox knowledge triple target.
Further optionally, after crawling and parsing the Infobox knowledge triples on the entry web page for the entry belonging to the preset concept based on the original data, the method further includes: each top-level large class contains a subclass ontology concept, and the subclass ontology concept contains the corresponding triple.
Further optionally, the method further comprises: all triplets of sub-category concepts are filtered in the form of labels Y1 or N according to semantic relationships.
Further optionally, the crawling and parsing of the Infobox knowledge triples on the entry web page on the entry belonging to the preset concept based on the original data includes: the crawler tool is used for crawling and analyzing the encyclopedia structured information contained in the open classification page and the entry page of the interactive encyclopedia and the encyclopedia.
Further optionally, the calculating the initial similarity of the Infobox knowledge triples includes: calculating a first initial similarity of the triples based on the edit distance;
calculating a second initial similarity of the triples based on the synonym forest;
and performing complementary fusion on the first initial similarity and the second initial similarity according to a preset mode to obtain the initial similarity.
Further optionally, the adding a semantic distance to the Infobox knowledge triple tag includes: and performing semantic distance calculation by traversing the Chinese encyclopedia classification tree.
Further optionally, the method further comprises: and introducing a pseudo-nuclear force field potential function with improved tag semantic distance.
Further optionally, the performing knowledge denoising according to the Infobox knowledge triple target similarity includes: and arranging the original documents and the documents processed by the improved data field algorithm from big to small according to the similarity, acquiring a preset number of original data, and performing knowledge denoising.
In another aspect, a system for denoising knowledge of chinese network encyclopedia is provided, where the system includes: the device comprises a collection module, an acquisition module, a calculation module and a knowledge denoising module.
The collection module is used for collecting original data in the open encyclopedia resources of the Chinese network;
the acquisition module comprises a first acquisition unit and a second acquisition unit; the first acquisition unit is used for crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for a vocabulary entry to which a preset concept belongs based on the original data; the second obtaining unit is used for crawling entry labels of the Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;
the computing module comprises a first computing unit and a second computing unit; the first computing unit is used for computing the initial similarity of the Infobox knowledge triple; the second computing unit is used for adding semantic distance to the Infobox knowledge triple label and acquiring Infobox knowledge triple target similarity through a data field according to the initial similarity and a preset method;
and the knowledge denoising module is used for denoising the knowledge according to the similarity of the Infobox knowledge triple target.
In another aspect, a knowledge base is constructed by applying any one of the above-mentioned denoising methods for the knowledge base.
The method, the system and the knowledge base for denoising the Chinese network encyclopedic knowledge provided by the embodiment of the invention are based on the fusion of an editing distance and a synonym forest method, a knowledge triple data field is constructed by means of a vocabulary entry tag to denoise massive knowledge triples, so that the phenomenon of massive repeated ambiguity in a Chinese open encyclopedic knowledge base is reduced as much as possible, and the problems that semantic repetition, synonymy and vocabulary entry Infobox knowledge triples are improperly classified when the Chinese knowledge base is constructed in the prior art are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a denoising method for network encyclopedia knowledge in an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a network encyclopedia knowledge denoising system in an embodiment of the present invention;
FIG. 3 is a schematic diagram of 41% N number-to-column comparison after the Chinese network encyclopedia knowledge denoising method provided by the present invention verifies the subclasses of the embodiments;
FIG. 4 is a schematic diagram of broken lines comparing 41% N numbers after the Chinese network encyclopedia knowledge denoising method provided by the present invention verifies the subclasses of the embodiments;
FIG. 5 is a schematic diagram of a 41% P-value comparison column after deletion in two stages according to an embodiment of the denoising method for encyclopedic knowledge of Chinese network provided by the present invention;
FIG. 6 is a schematic diagram of 41% P-value comparison broken lines after deletion in two stages of a verification embodiment of the Chinese network encyclopedia knowledge denoising method provided by the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
In order to more clearly illustrate the process and advantages of the method of the embodiment of the invention, the invention provides a Chinese network encyclopedia knowledge denoising method, which comprises the following steps:
collecting original data in the open encyclopedia resources of the Chinese network;
crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for a vocabulary entry to which a preset concept belongs based on original data;
crawling entry labels of Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;
calculating initial similarity of Infobox knowledge triples;
adding semantic distance to the Infobox knowledge triple labels, and acquiring target similarity of the Infobox knowledge triples through a data field according to a preset method according to the initial similarity;
and carrying out knowledge denoising according to the similarity of the Infobox knowledge triple target.
The Chinese network encyclopedia knowledge denoising method provided by the embodiment of the invention is based on the fusion of an editing distance and a synonym forest method, a knowledge triple data field is constructed by means of vocabulary entry labels to denoise massive knowledge triples firstly, so that the phenomenon of massive repeated ambiguity in a Chinese open encyclopedia knowledge base is reduced as much as possible, and the problems of semantic repetition, synonymy and improper classification of knowledge triples and the like when the Chinese knowledge base is constructed in the prior art are solved.
Based on the method for constructing the Chinese knowledge base, the embodiment of the invention provides an optional embodiment: fig. 1 is a flowchart of a network encyclopedia knowledge denoising method in an embodiment of the present invention. Referring to fig. 1, the method for denoising the chinese network encyclopedia knowledge in this embodiment may include the following steps:
and S11, collecting the original data in the open encyclopedia resources of the Chinese network.
Specifically, Chinese network encyclopedia resources are selected, for example, vocabulary entry web pages of encyclopedia and interactive encyclopedia are selected as raw data sources, and raw data are collected in the raw data sources.
S12, crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for the vocabulary entry to which the preset concept belongs based on the original data; .
Specifically, a crawler tool is used for crawling and analyzing the Infobox structured information contained in the open classification pages and the entry pages of the interactive encyclopedia and the encyclopedia, and the Infobox structured information is organized in a Chinese Triple (Triple) form to form a large-scale Chinese open domain knowledge base to be refined.
After crawling and analyzing the Infobox knowledge triples on the vocabulary entry web page for the vocabulary entry to which the preset concept belongs based on the original data, the method further comprises the following steps: organizing each top-level large class into an ontology concept hierarchical relationship containing a subclass concept, and then organizing the subclass ontology concept to contain the corresponding triple.
And S13, crawling entry labels of the Infobox knowledge triples contained in the preset subclasses, and adding the labels to the Infobox knowledge triples.
Specifically, on the basis of subclass concept instance triples, the labels of related terms are crawled from Chinese network encyclopedia and added to the triples corresponding to each term in the corresponding triple documents, and the triples are marked in an agreed format. In this embodiment, there are three layers of ontology concept structures in the Baidu encyclopedia classification tree. Selecting top-level classification in ontology concept set of Chinese network encyclopedias (namely: encyclopedia and interactive encyclopedia): geography, economy, science, history, personage, society, life, sports, culture, art, nature.
All triplets of sub-class concepts are first screened and labeled in the form of label Y1 or N according to semantic relationships. The labeling of Y1 and N may be performed manually. For example, the labeling of Y1 and N in this embodiment is performed by manual labeling, and the labeling principle is set to determine whether the attributes of the sub-categories related to the concept to which the triplets belong are semantically matched, and if so, the triplets are labeled as Y1, and if not, the triplets are labeled as N. Note that the manual labeling herein is not a limitation on the labeling method, but merely an example.
And S14, calculating initial similarity of the Infobox knowledge triples.
In the invention, the definition triple is respectively composed of a subject, a predicate and an object, and a knowledge triple is represented by < S, P and O >, wherein S represents the subject of the triple, P represents the predicate, and O represents the object. The initial similarity calculation is carried out between the encyclopedia subclass triple document and the interactive encyclopedia top-level large-class triple document corresponding to the encyclopedia subclass triple document. For example: to calculate the initial similarity of the encyclopedia subclass document "currency", it needs to perform the initial similarity calculation of the triples with the large class "economy" document of the interactive encyclopedia, because the "currency" subclass belongs to the large class "economy" in the encyclopedia, it needs to map all the triples in the triple document of the large class "economy" in the interactive encyclopedia to calculate the initial similarity and accumulate.
Specifically, when initial similarity calculation is performed, a first initial similarity is calculated based on an edit distance; calculating a second initial similarity based on the synonym forest; and performing complementary fusion on the first initial similarity and the second initial similarity according to a preset mode to obtain a third initial similarity.
In the initial similarity calculation, a calculation method with less resource requirement and high efficiency is considered, so that a similarity calculation method based on an editing distance is adopted. SIM algorithm by edit distanceELiteral similarity between triples can be obtained while ignoring semantic relevance.
Setting any one triple B in a certain subclass B of Baidu encyclopediai=<bis,bip,bio>Any triple H in the interactive hundred-family major document H corresponding to the document Bj=<hjs,hjp,hjo>For the triplet b, theniAnd hjThe edit distance similarity between the subjects is given by formula (1). Here, the edit distance similarity between predicates and between objects can be obtained.
Figure GDA0002745565490000071
Wherein, | Step (b)is,hjs) L is such that bisAnd hjsThe number of editing operation steps, len (b), required to be equal to each otheris) And len (h)js) Meaning the word bisAnd hjsThe length of the number of characters.
And calculating to obtain the first initial similarity of the Baidu encyclopedia knowledge triple.
The synonym forest is compiled by Meijia colt and the like in 1983, is a Chinese synonym dictionary, is originally intended to provide more synonyms and is helpful for creation and translation work. The dictionary not only includes synonyms of a word, but also includes a certain number of similar words, namely related words in a broad sense. The synonym forest is a Chinese synonym dictionary, which encodes each vocabulary and organizes the vocabulary in a tree structure with five layers from top to bottom in a hierarchical relationship. Each layer is provided with a corresponding code mark, and the five layers of codes are sequentially arranged from left to right to form the word forest code of the lemma. Each node in the tree represents a concept, and the semantic relevance between the terms is improved along with the increase of the hierarchy. The concept coreference relation recognition of Chinese can be practically abstracted as a recognition problem of Chinese synonyms. In the present invention, we actually use an extended version thereof, namely: the forest of synonyms (expanded version) of Harbin Industrial university, as dictionary thesaurus for the second initial similarity calculation.
According to the structural characteristics of a word forest, firstly, the word forest codes of a subject, a predicate and an object in a triple are analyzed, first to fifth layer sub-codes are extracted, and then comparison is started from the first layer sub-codes. If the sub-codes appear differently, the mapping pair is given corresponding similarity weight according to the appearing hierarchy. The deeper the sub-codes appear differently, the higher the similarity weight, and vice versa. Meanwhile, the number of branch nodes in each layer also has an influence on the similarity. In this embodiment, we adopt an improved similarity calculation formula, and still take the similarity calculation between the subjects of the triples as an example, and the similarity calculation between predicates and objects can be obtained by the same method as that shown in formula (2).
Figure GDA0002745565490000072
Wherein, λ is a semantic relevancy factor of the adjusting parameter, so as to control the possible similarity degree between the lemmas at different levels of branches, and λ belongs to (0, 1); l ═ 1, 2, 3, 4, 5, for
Figure GDA0002745565490000073
LnIs the number of layers represented by the nth layer, | L | is the number of elements in the set L, and is equal to 5 in the system. N is a radical ofTIs the word bisAnd hjsTotal number of nodes on the n-th level branch, D being the term bisAnd hjsThe coding distance of (1).
Through calculation, a second initial similarity of the hundred-degree encyclopedia knowledge triple can be obtained.
Due to the SIMEAlgorithm and SIMTThe algorithms have semantic complementarity, so that the similarity results of the two algorithms are subjected to complementary fusion in the embodiment, and the maximum value of the results of the two algorithms is taken.
The invention provides any one triple B in an encyclopedia subclass document BiThe initial similarity calculation of (2) is shown by equation (3):
Figure GDA0002745565490000081
wherein 0.3, 0.5 and 0.2 are subject similarity, predicate similarity and object similarity, respectively, in the whole triplet biThe weight coefficient occupied during the initial similarity calculation can be adjusted according to the target effect.
The specific algorithm is as follows:
InterlinkingValue(B,H)
inputting: baidu encyclopedia certain subclass triple document B and corresponding interactive encyclopedia large class triple document H
And (3) outputting: triple initial similarity hash table Map _ B < Key, Value >
Figure GDA0002745565490000082
Figure GDA0002745565490000091
And calculating to obtain the initial similarity of the knowledge triples corresponding to the encyclopedia certain subclass document B.
S15, adding semantic distance to the Infobox knowledge triple label, and obtaining the Infobox knowledge triple target similarity through a data field according to a preset method according to the initial similarity.
It should be noted that a field mathematically refers to the mapping of one vector to another vector or number. In physics, a field refers to a region of space where each point is subjected to a force. The initial field mainly refers to a physical field such as a magnetic field, an electric field, a gravitational field, and the like. In the above-mentioned physical fields, the interaction between particles is usually described by means of a vector field strength function and a scalar potential function. Similar to the physical field, a vector field strength function and a scalar potential function may also be defined in the data field. The data field theory is provided based on the field theory thought in physics, and the mutual relation among data in a number domain space is abstracted into the interaction problem among substance particles, and finally is formed into a description method of the field theory. The theory expresses the interaction relation among different data through a potential function, thereby reflecting the distribution characteristics of the data and clustering and dividing the data set according to an equipotential line structure in a data field.
And when the triple target similarity is calculated, adding semantic distance to the label. In the embodiment, when the semantic distance is added to the label, a concept, namely a potential function, is introduced.
Suppose F is generated for data in DData field, function fX(Y) is a potential function thereof, wherein X ∈ D and Y ∈ Ω. It indicates the potential value, f, of the data element X at YX(Y) must satisfy the following condition: (1) f. ofX(Y) is a continuous, smooth, bounded function; (2) f. ofX(Y) has isotropy; (3) f. ofX(Y) is a decreasing function with respect to the distance X-Y, when 0, fX(Y) taking the maximum value; when | | | X-Y | | → ∞, fX(Y) → 0. In this embodiment, a more general potential function example is listed:
pseudo nuclear force field potential function:
Figure GDA0002745565490000092
wherein m.gtoreq.0 represents the influence strength of X on Y, and can be understood as the mass of X.
Figure GDA0002745565490000093
Called impact factors, which determine the impact range of the element. When in use
Figure GDA0002745565490000094
When the value of the potential function increases.
In the present invention, assume a certain triplet biIf the label set of (a) is T, then T ═ T1,t2,t3……tn),n>0 denotes the number of tags, tiIs a label of the circle center. It should be noted that, as a common general knowledge, all the vocabulary entry classification labels mentioned in the present invention belong to the concept set in the encyclopedia open classification system. In this embodiment, the shortest path length between two tags is d ═ ti-tjL. Triple b obtained based on data field theoryiCircle center label tiWith other labels tjThe field strength function expression of the interaction is as follows:
Figure GDA0002745565490000101
wherein, S _ biAnd representing the initial similarity of the triples i associated with the sub-class concepts belonging to different major classes corresponding to the labels. The main categories of encyclopedia and interactive encyclopedia are: the top 11 broad classes in the encyclopedia and interactive encyclopedia classification trees are: character, sports, life, culture, science, economy, history, society, geography, nature, art.
And adding semantic distances to all the entry labels of each knowledge triple. If tag and triplet biThe subclasses of the same concept are circle center labels, and the distance value is 0. The semantic distance between the labels is the distance between other labels in the same large concept set and a circle center label. We stipulate that if the current label belongs to the current top-level large ontology concept set, the current label must be the parent or child concept of the circle-center label, and if the current label is the direct parent or child concept of the circle-center label, the semantic distance is 1, because the maximum depth of the Baidu encyclopedia classification tree is 2, and so on, the maximum semantic distance between labels is 6.
For example, the triple under a river (suzhou river/chinese name/suzhou river) is taken as an example, because the suzhou river is a river but carries song, movie and drama labels. Geography is the direct father of the river and is also the top level large class, so the semantic distance between geography and the river is 1. Songs and scenarios are subclasses of top-level major life, while songs and scenarios are subclasses of leisure major, so the path between the river and the song can be represented as river-geography-root-life-leisure-song (scenarios), and thus the semantic distance is 5.
The invention punishs the tags which do not belong to the current main ontology concept, and aims to weaken the initial similarity of the triples containing the tags with longer semantic distance so as to carry out secondary sequencing based on the similarity of the target, and the triples with the ranked target are rejected out of a knowledge base. Therefore, field intensity calculation is carried out based on the tags in the triplets, the obtained target similarity is the result of correcting the initial similarity, and the optimized piecewise function formula (6) is as follows:
Figure GDA0002745565490000111
wherein the content of the first and second substances,
Figure GDA0002745565490000112
is the piecewise function we have built. Sign representing current label tjAnd the circle center label tiWhether the tag pairs are in the same major class or not, and if the tag pairs are in the same major class, the tag pairs represent the triples biHas positive acting force; if not, the tag pair triple b is representediThere is a counter-acting force.
And finally, giving an Baidu encyclopedic triple target similarity calculation formula (7):
Figure GDA0002745565490000113
the specific algorithm is as follows:
algorithm Choose (O, Map _ B, T)
Inputting: baidu encyclopedia body O, initial similarity set Map _ B of triples and tag set T
And (3) outputting: triple target similarity set Map _ B 'behind data field'
Figure GDA0002745565490000114
Figure GDA0002745565490000121
Sequencing the original document and the document processed by the data field from big to small according to the similarity, so that the method has a good effect on the rank reduction of the marked N, namely the incorrect triple, and is more beneficial to denoising and optimizing a knowledge base.
And S16, carrying out knowledge denoising according to the similarity of the Infobox knowledge triple target.
And after the initial similarity is subjected to data field processing, sorting in a descending order according to the target similarity value. For a sub-concept document, according to the thought of the golden section point, the number of the triples marked with N in the last 41% of the triples is obtained, the triples are compared with the number of N in the last 41% of the initial similarity without data field processing, and the result is reasonably analyzed and compared. Specifically, according to the calculated target similarity, similarity ranking is performed, and ranked triples with a predetermined proportion are removed. For example, triples ranked 41% later are removed.
The Chinese network encyclopedia knowledge denoising method provided by the embodiment of the invention is based on the fusion of an editing distance and a synonym forest method, a knowledge triple data field is constructed by means of vocabulary entry labels to denoise massive knowledge triples firstly, so that the phenomenon of massive repeated ambiguity in a Chinese open encyclopedia knowledge base is reduced as much as possible, and the problems of semantic repetition, synonymy, improper classification of knowledge triples and the like when the Chinese knowledge base is constructed facing network encyclopedia in the prior art are solved.
In order to further explain the technical scheme, the invention also provides an embodiment of a Chinese network encyclopedia knowledge denoising system.
Fig. 2 is a schematic structural diagram of a network encyclopedia knowledge denoising system in an embodiment of the present invention. Referring to fig. 2, the system for denoising chinese network encyclopedia knowledge in the embodiment of the present invention includes: a collection module 21, an acquisition module 22, a calculation module 23 and a denoising module 24.
The collection module 21 is configured to collect raw data from the open encyclopedia resource of the chinese network;
an acquiring module 22, including a first acquiring unit 221 and a second acquiring unit 222; the first obtaining unit 221 is configured to perform crawling and parsing on an Infobox knowledge triple on a vocabulary entry web page for a vocabulary entry to which a preset concept belongs based on original data; the second obtaining unit 222 is configured to crawl entry tags of the Infobox knowledge triples included in the preset subclasses, and add the tags to the Infobox knowledge triples;
the calculation module 23 includes a first calculation unit 231 and a second calculation unit 232; the first calculating unit 231 is used for calculating initial similarity of the Infobox knowledge triples; the second calculating unit 232 is configured to add a semantic distance to the Infobox knowledge triple tag, and obtain an Infobox knowledge triple target similarity according to the initial similarity and a preset method through a data field;
the knowledge denoising module 24 is configured to perform knowledge denoising according to the similarity of the target of the Infobox knowledge triplet.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The Chinese network encyclopedia knowledge denoising system provided by the embodiment of the invention is based on the fusion of an editing distance and a synonym forest method, a knowledge triple data field is constructed by means of vocabulary entry labels to denoise massive knowledge triples firstly, so that the phenomenon of massive repeated ambiguity in a Chinese open encyclopedia knowledge base is reduced as much as possible, and the problems of semantic repetition, synonymy, improper classification of knowledge triples and the like when a network encyclopedia-oriented Chinese knowledge base is constructed in the prior art are solved.
In order to further explain the technical scheme, the invention also provides an embodiment of a knowledge base.
The knowledge base provided by the embodiment is denoised by applying the Chinese network encyclopedia knowledge denoising method.
The knowledge base of the embodiment is constructed based on the combination of the editing distance and the synonym forest method, the knowledge triple data field is constructed by means of the entry labels to denoise massive knowledge triples, the phenomenon of massive repeated ambiguity in the Chinese open encyclopedia knowledge base is reduced as much as possible, and the problems that semantic repetition, synonymy and improper classification of the knowledge triples are caused when the Chinese knowledge base is constructed in the prior art are solved.
In order to verify the effect of the construction of the Chinese knowledge base, the invention also provides a verification embodiment.
Data sets are extracted from Chinese network open encyclopedias such as encyclopedia and interactive encyclopedia, and two experiments are designed:
(1) the number of the last 41% N in each sub-class data after the first stage and the second stage is compared.
(2) The P values 41% after deletion of each subclass were compared for both phases and completely unprocessed.
And (4) observing the effect of the stage treatment by the two comparison methods, and obtaining evaluation and conclusion. Meanwhile, all programs of this embodiment are implemented by JAVA coding in a stable system environment.
The embodiment adopts a Chinese network encyclopedia open encyclopedia knowledge base as an experimental data source. In the embodiment, a crawler toolkit HTMLParser is used for crawling and analyzing the Infobox structured information contained in the open classification pages and the entry pages of the encyclopedia and the interactive encyclopedia respectively, and organizing the Infobox structured information in a Chinese triple form to form an initial large-scale Chinese open domain knowledge base. And then dividing the crawled data into top level large classes in 11 ontology concept sets, wherein each top level large class comprises subclass ontology concepts, and each subclass ontology concept comprises corresponding triples. Specifically, the information of the encyclopedia knowledge base in table 1 is shown.
TABLE 1 Chinese encyclopedia knowledge base information
Figure GDA0002745565490000141
Referring to table 1, the number of the 11 top-level major classes, the subclasses of the major classes, and the number of the term triples in the corresponding subclasses, and the number of Y1, N, can be seen. Wherein, Y1 and N are the result of manual labeling, and the principle is to see whether the triple matches with the related subclass property, if so, Y1 is labeled, otherwise, N is labeled.
And performing initial similarity assignment on each triple after calculating the editing distance and the synonym forest similarity, and performing file merging on Y1 or N which is artificially labeled on the triple and labels and subclass concepts of related concepts extracted from the open encyclopedia of the Chinese network to form an input data set processed by the improved data field algorithm provided by the invention at the next stage.
And sorting the triples in the merged subclass concept file from big to small according to the initial similarity.
And further optimizing and correcting the initial similarity based on the data field to finally obtain the target similarity.
Evaluation of experimental data:
first is Precision (Precision) for the selected triples:
p is the number of Y1 exported/the total number exported X100%
The experiment in this document aims at denoising, i.e. requires the accuracy of the triplets in the document, and the effect of this system can be fully satisfied by observing the precision ratio by extracting the triplets labeled Y1.
A plurality of subclasses under the top-level classification in the Chinese network encyclopedia concept set are selected, and the P is found through comparison to better reflect the efficiency before and after the algorithm, so that the evaluation target is shown in the table 2.
TABLE 2. Chinese network encyclopedia ontology mapping evaluation statistical table
Figure GDA0002745565490000151
Results of the experiment
(1) Experiment 1:
after data field processing is carried out on the triple labels, the obtained similarity and the completely unprocessed similarity are sorted from large to small, then labels N of the last 41% of the triples in the respective subclass concept documents are respectively taken, and the number change of the N is compared. FIG. 3 is a column diagram showing 41% N number comparisons after the subclass of the embodiment of the method for constructing a Chinese knowledge base provided by the present invention is verified. FIG. 4 is a schematic diagram of broken lines comparing N numbers in 41% after the subclass of the embodiment of the method for constructing a Chinese knowledge base provided by the present invention is verified.
TABLE 3 variation of the number of N in 41% after subclass before and after data field
Figure GDA0002745565490000161
Referring to table 3, the number of N in the last 41% of most subclass documents is increased after the data field, and the effect is more significant, most notably, the number of N is increased by 347. Meanwhile, the number of N in the three subclasses of documents, such as oceans, banks and buildings, is reduced, but the reduction degree is small, and considering that the number of the total N is increased by 761, the N can be ignored.
Referring to fig. 3 and 4, it is apparent that the polyline representing the back of a data field is generally above the polyline representing the front of the data field. Therefore, it can be determined that the number of N in the last 41% of the sub-class documents is increased as a whole after the data field processing, because the data field processing can improve certain accuracy for the construction of the knowledge base of the encyclopedia opened by the Chinese network.
(2) Experiment 2
The precision ratio is an important index for evaluating the information retrieval effect, and the experiment divides the original completely unprocessed subclass file data into two stages: the first is a stage after sorting according to the initial similarity from big to small; the second is a stage of sorting according to similarity from large to small after processing based on the data field. In the two stages, the original completely unprocessed data is firstly deleted, then 41% of subclass document triples are deleted, and then the P value of the rest triples is obtained for periodic comparison.
The experimental results are shown in the figure: FIG. 5 is a schematic diagram of 41% P-value comparison columns after deletion in two stages of the verification embodiment of the method for constructing a Chinese knowledge base provided by the present invention. FIG. 6 is a schematic diagram of a 41% comparison polyline of P values after deletion in two stages of the verification embodiment of the method for constructing a Chinese knowledge base provided by the present invention.
Referring to fig. 5 and 6, most of P values are increased along with the increase of the stage, but the P values of some sub-class documents such as events, languages, calligraphy and painting are lower than those of the original data in the first stage, which means that the P values can be increased only slightly in the first stage, but the efficiency is not high. Then, table 4 is made again to list the P values of the three stages of each subclass document, and meanwhile, the percentage of the P value improvement of the third stage compared with the original data is compared in the last column, so that it is obvious from the data that the P value after data field processing is basically improved remarkably compared with the P value when the data field is not processed, only the subclass document buildings are reduced slightly, which is also negligible compared with the improvement of the overall P value, so that the method based on data field processing can be judged to improve the accuracy of the knowledge base construction.
TABLE 4P value 41% after two stage deletion
Figure GDA0002745565490000171
Figure GDA0002745565490000181
Conclusion of the experiment
The above experimental results show that: after the triples are processed based on the improved data field, the initial similarity is optimized and corrected, the ambiguity problem in the open encyclopedia of the Chinese network and the problem that the knowledge triples are not properly classified are effectively avoided to a certain extent, the accuracy of establishing the Chinese knowledge base is finally improved, a large number of related ontology elements can be reserved, and the concept recall rate is ensured.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the present invention shall be covered thereby. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (3)

1. A Chinese network encyclopedia knowledge denoising method is characterized by comprising the following steps:
collecting original data in the open encyclopedia resources of the Chinese network;
based on the original data, crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for the vocabulary entry to which the preset concept belongs, wherein the crawling and analyzing comprises the following steps: crawling and analyzing the structural information Infobox of the interactive encyclopedia and encyclopedia entry web pages contained in the open classification pages and entry pages of the interactive encyclopedia and encyclopedia by using a crawler tool;
each encyclopedic top-level large class contains a subclass ontology concept, and the subclass ontology concept contains the corresponding triple;
each knowledge triple is < S, P, O >, wherein S represents a subject of the triple, P represents a predicate, and O represents an object;
crawling entry labels of the Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;
calculating the initial similarity of the Infobox knowledge triple, comprising the following steps: calculating a first initial similarity of the triples based on the edit distance; calculating a second initial similarity of the triples based on the synonym forest; performing complementary fusion on the first initial similarity and the second initial similarity according to a preset mode to obtain initial similarity;
the first initial similarity includes similarity between the subjects, similarity between the predicates, and similarity between the objects; the similarity between the subjects is as follows: triplet biAnd hjThe edit distance similarity between the subjects is given by equation (1):
Figure FDA0002745565480000011
wherein, | Step (b)is,hjs) L is such that bisAnd hjsThe number of editing operation steps, len (b), required to be equal to each otheris) And len (h)js) Meaning the word bisAnd hjsThe length of the number of characters of (a); any one triple B in a certain subclass of Baidu encyclopedia documents Bi=<bis,bip,bio>Wherein b isisRepresenting a triplet biSubject of inclusion, bipRepresenting a triplet biContaining predicates, bioRepresenting a triplet biAn object contained; any triple H in interactive hundred-family major documents H corresponding to document Bj=<hjs,hjp,hjo>Wherein h isjsRepresenting a triplet hjSubject of inclusion, hjpRepresenting a triplet hjContaining predicate, hjoRepresenting a triplet hjAn object contained;
the second initial similarity includes: a second similarity between the subjects, a second similarity between the predicates, and a second similarity between the objects; what is needed isThe second similarity between the subjects is: triplet biAnd hjThe second similarity between the subjects is given by equation (2):
Figure FDA0002745565480000021
wherein, λ is a semantic relevancy factor of the adjusting parameter, so as to control the possible similarity degree between the lemmas at different levels of branches, and λ belongs to (0, 1); l ═ 1, 2, 3, 4, 5, for
Figure FDA0002745565480000023
LnIs the number of layers represented by the nth layer, | L | is the number of elements in the set L, and is always equal to 5 in the system; n is a radical ofTIs the word bisAnd hjsTotal number of nodes on the n-th level branch, D being the term bisAnd hjsThe coding distance of (a);
any one triple B in the Baidu encyclopedia subclass document BiThe initial similarity calculation of (2) is shown by equation (3):
Figure FDA0002745565480000022
wherein 0.3, 0.5 and 0.2 are subject similarity, predicate similarity and object similarity, respectively, in the whole triplet biThe weight coefficient occupied during the initial similarity calculation;
adding semantic distance to the Infobox knowledge triple label, and acquiring Infobox knowledge triple target similarity through a data field according to a preset method according to the initial similarity;
the adding of the semantic distance to the Infobox knowledge triple label comprises the following steps: performing semantic distance calculation by traversing the Chinese encyclopedia classification tree;
further comprising: introducing a pseudo-nuclear force field potential function with improved tag semantic distance;
let F be the data field generated by the data in D, function FX(Y) is a potential function, wherein X belongs to D, and Y belongs to omega; it indicates the potential value, f, of the data element X at YX(Y) satisfies the following condition: (1) f. ofX(Y) is a continuous, smooth, bounded function; (2) f. ofX(Y) has isotropy; (3) f. ofX(Y) is a decreasing function with respect to the distance X-Y, when 0, fX(Y) taking the maximum value; when | | | X-Y | | → ∞, fX(Y)→0;
Pseudo nuclear force field potential function:
Figure FDA0002745565480000031
wherein m is more than or equal to 0, which represents the influence intensity of X on Y and can be understood as the mass of X;
Figure FDA0002745565480000032
called influence factor, determines the influence range of the element; when in use
Figure FDA0002745565480000033
When the potential function value is increased, the potential function value is increased;
a certain triplet biIf the label set of (a) is T, then T ═ T1,t2,t3……tn),n>0 denotes the number of tags, tiIs a label of the circle center;
the shortest path length between two tags is d ═ ti-tjL, |; triple b obtained based on data field theoryiCircle center label tiWith other labels tjThe field strength function expression of the interaction is as follows:
Figure FDA0002745565480000034
and correcting the initial similarity, wherein the optimized piecewise function formula (6) is as follows:
Figure FDA0002745565480000035
wherein the content of the first and second substances,
Figure FDA0002745565480000036
is a piecewise function; sign representing current label tjAnd the circle center label tiWhether the tag pairs are in the same major class or not, and if the tag pairs are in the same major class, the tag pairs represent the triples biHas positive acting force; if not, the tag pair triple b is representediThe reverse acting force exists;
baidu encyclopedia triple target similarity calculation formula (7):
Figure FDA0002745565480000037
and denoising the knowledge according to the similarity of the Infobox knowledge triple target, comprising the following steps: arranging the original documents and the documents processed by the improved data field algorithm from big to small according to the similarity, acquiring a preset number of original data, and performing knowledge denoising;
further comprising: screening all triples of the subclass concepts in a mode of labeling Y1 or N according to semantic relations, wherein Y1 and N are results of manual labeling, the principle is to see whether the triples are matched with the related subclass attributes, if so, labeling Y1, and otherwise, labeling N;
after the initial similarity is subjected to data field processing, sorting in a descending order according to the target similarity; and according to the calculated target similarity, carrying out similarity ranking and removing the triples ranked at the last 41%.
2. A Chinese network encyclopedia knowledge denoising system, which is characterized by comprising: the device comprises a collecting module, an obtaining module, a calculating module and a knowledge denoising module;
the collection module is used for collecting original data in the open encyclopedia resources of the Chinese network;
the acquisition module comprises a first acquisition unit and a second acquisition unit; the first obtaining unit is used for crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for a vocabulary entry to which a preset concept belongs based on the original data, and comprises: crawling and analyzing the structural information Infobox of the interactive encyclopedia and encyclopedia entry web pages contained in the open classification pages and entry pages of the interactive encyclopedia and encyclopedia by using a crawler tool; each encyclopedic top-level large class contains a subclass ontology concept, and the subclass ontology concept contains the corresponding triple; each knowledge triple is < S, P, O >, wherein S represents a subject of the triple, P represents a predicate, and O represents an object;
the second obtaining unit is used for crawling entry labels of the Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;
the computing module comprises a first computing unit and a second computing unit; the first computing unit is used for computing the initial similarity of the Infobox knowledge triples, and comprises the following steps: calculating a first initial similarity of the triples based on the edit distance; calculating a second initial similarity of the triples based on the synonym forest; performing complementary fusion on the first initial similarity and the second initial similarity according to a preset mode to obtain initial similarity;
the first initial similarity includes similarity between the subjects, similarity between the predicates, and similarity between the objects; the similarity between the subjects is as follows: triplet biAnd hjThe edit distance similarity between the subjects is given by equation (1):
Figure FDA0002745565480000041
wherein, | Step (b)is,hjs) L is such that bisAnd hjsThe number of editing operation steps, len (b), required to be equal to each otheris) And len (h)js) Meaning the word bisAnd hjsThe length of the number of characters of (a); baidu Baike (one of hundred departments)Any triple B in subclass document Bi=<bis,bip,bio>Wherein b isisRepresenting a triplet biSubject of inclusion, bipRepresenting a triplet biContaining predicates, bioRepresenting a triplet biAn object contained; any triple H in interactive hundred-family major documents H corresponding to document Bj=<hjs,hjp,hjo>Wherein h isjsRepresenting a triplet hjSubject of inclusion, hjpRepresenting a triplet hjContaining predicate, hjoRepresenting a triplet hjAn object contained;
the second initial similarity includes: a second similarity between the subjects, a second similarity between the predicates, and a second similarity between the objects; the second similarity between the subjects is: triplet biAnd hjThe second similarity between the subjects is given by equation (2):
Figure FDA0002745565480000051
wherein, λ is a semantic relevancy factor of the adjusting parameter, so as to control the possible similarity degree between the lemmas at different levels of branches, and λ belongs to (0, 1); l ═ 1, 2, 3, 4, 5, for
Figure FDA0002745565480000052
LnIs the number of layers represented by the nth layer, | L | is the number of elements in the set L, and is always equal to 5 in the system; n is a radical ofTIs the word bisAnd hjsTotal number of nodes on the n-th level branch, D being the term bisAnd hjsThe coding distance of (a);
any one triple B in the Baidu encyclopedia subclass document BiThe initial similarity calculation of (2) is shown by equation (3):
Figure FDA0002745565480000053
wherein 0.3, 0.5 and 0.2 are subject similarity, predicate similarity and object similarity, respectively, in the whole triplet biThe weight coefficient occupied during the initial similarity calculation;
the second computing unit is used for adding semantic distance to the Infobox knowledge triple label and acquiring Infobox knowledge triple target similarity through a data field according to the initial similarity and a preset method; the adding of the semantic distance to the Infobox knowledge triple label comprises the following steps: performing semantic distance calculation by traversing the Chinese encyclopedia classification tree;
further comprising: introducing a module; the introducing module is used for introducing a pseudo-nuclear force field potential function with improved tag semantic distance;
let F be the data field generated by the data in D, function FX(Y) is a potential function, wherein X belongs to D, and Y belongs to omega; it indicates the potential value, f, of the data element X at YX(Y) satisfies the following condition: (1) f. ofX(Y) is a continuous, smooth, bounded function; (2) f. ofX(Y) has isotropy; (3) f. ofX(Y) is a decreasing function with respect to the distance X-Y, when 0, fX(Y) taking the maximum value; when | | | X-Y | | → ∞, fX(Y)→0;
Pseudo nuclear force field potential function:
Figure FDA0002745565480000061
wherein m is more than or equal to 0, which represents the influence intensity of X on Y and can be understood as the mass of X;
Figure FDA0002745565480000062
called influence factor, determines the influence range of the element; when in use
Figure FDA0002745565480000063
When the potential function value is increased, the potential function value is increased;
a certain triplet biIf the label set of (a) is T, then T ═ T1,t2,t3……tn),n>0 denotes the number of tags, tiIs a label of the circle center;
the shortest path length between two tags is d ═ ti-tjL, |; triple b obtained based on data field theoryiCircle center label tiWith other labels tjThe field strength function expression of the interaction is as follows:
Figure FDA0002745565480000064
further comprising: a correction module;
the correction module is used for correcting the initial similarity, and the optimized piecewise function formula (6) is as follows:
Figure FDA0002745565480000065
wherein the content of the first and second substances,
Figure FDA0002745565480000066
is a piecewise function; sign representing current label tjAnd the circle center label tiWhether the tag pairs are in the same major class or not, and if the tag pairs are in the same major class, the tag pairs represent the triples biHas positive acting force; if not, the tag pair triple b is representediThe reverse acting force exists;
the first calculating unit is further configured to calculate an Baidu encyclopedia triple target similarity, and calculate formula (7):
Figure FDA0002745565480000067
the knowledge denoising module is used for denoising the knowledge according to the similarity of the Infobox knowledge triple target, and comprises: arranging the original documents and the documents processed by the improved data field algorithm from big to small according to the similarity, acquiring a preset number of original data, and performing knowledge denoising;
the knowledge denoising module is further used for screening all triples of the subclass concepts in a form of labeling Y1 or N according to the semantic relationship, wherein Y1 and N are results of manual labeling, the principle is to see whether the triples are matched with the related subclass attributes, if so, Y1 is labeled, otherwise, N is labeled; after the initial similarity is subjected to data field processing, sorting in a descending order according to the target similarity; and according to the calculated target similarity, carrying out similarity ranking and removing the triples ranked at the last 41%.
3. A knowledge base, wherein the construction of the knowledge base applies the Chinese network encyclopedia knowledge denoising method of claim 1.
CN201910024995.8A 2019-01-10 2019-01-10 Chinese network encyclopedia knowledge denoising method, system and knowledge base Expired - Fee Related CN109783650B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910024995.8A CN109783650B (en) 2019-01-10 2019-01-10 Chinese network encyclopedia knowledge denoising method, system and knowledge base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910024995.8A CN109783650B (en) 2019-01-10 2019-01-10 Chinese network encyclopedia knowledge denoising method, system and knowledge base

Publications (2)

Publication Number Publication Date
CN109783650A CN109783650A (en) 2019-05-21
CN109783650B true CN109783650B (en) 2020-12-11

Family

ID=66500379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910024995.8A Expired - Fee Related CN109783650B (en) 2019-01-10 2019-01-10 Chinese network encyclopedia knowledge denoising method, system and knowledge base

Country Status (1)

Country Link
CN (1) CN109783650B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377747B (en) * 2019-06-10 2021-12-07 河海大学 Knowledge base fusion method for encyclopedic website
CN112308464B (en) * 2020-11-24 2023-11-24 中国人民公安大学 Business process data processing method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699767A (en) * 2015-02-15 2015-06-10 首都经济贸易大学 Large-scale ontology mapping method for Chinese languages
CN106503148A (en) * 2016-10-21 2017-03-15 东南大学 A kind of form entity link method based on multiple knowledge base

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10831762B2 (en) * 2015-11-06 2020-11-10 International Business Machines Corporation Extracting and denoising concept mentions using distributed representations of concepts

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699767A (en) * 2015-02-15 2015-06-10 首都经济贸易大学 Large-scale ontology mapping method for Chinese languages
CN106503148A (en) * 2016-10-21 2017-03-15 东南大学 A kind of form entity link method based on multiple knowledge base

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于数据场和全局序列比对的大规模中文关联数据模型_";王汀;《中文信息学报》;20160531;第30卷(第3期);第204-212页 *
"基于数据场的大规模本体映射";仲茜;《计算机学报》;20100615;第33卷(第6期);第955-964页 *

Also Published As

Publication number Publication date
CN109783650A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
Beliga et al. An overview of graph-based keyword extraction methods and approaches
Nagwani Summarizing large text collection using topic modeling and clustering based on MapReduce framework
Kiu et al. TaxoFolk: A hybrid taxonomy–folksonomy structure for knowledge classification and navigation
Beliga Keyword extraction: a review of methods and approaches
US8495001B2 (en) Systems and methods for semantic concept definition and semantic concept relationship synthesis utilizing existing domain definitions
Zhao et al. Ontology integration for linked data
Caldarola et al. An approach to ontology integration for ontology reuse
CN101814067B (en) System and methods for quantitative assessment of information in natural language contents
US20140040275A1 (en) Semantic search tool for document tagging, indexing and search
Corley et al. Exploring the use of deep learning for feature location
Tayal et al. ATSSC: Development of an approach based on soft computing for text summarization
Avasthi et al. Techniques, applications, and issues in mining large-scale text databases
CN104699767B (en) A kind of extensive Ontology Mapping Method towards Chinese language
CN111274358A (en) Text processing method and device, electronic equipment and storage medium
Chi et al. Developing base domain ontology from a reference collection to aid information retrieval
Moya et al. Integrating web feed opinions into a corporate data warehouse
CN109783650B (en) Chinese network encyclopedia knowledge denoising method, system and knowledge base
Consoli et al. A quartet method based on variable neighborhood search for biomedical literature extraction and clustering
Wei et al. DF-Miner: Domain-specific facet mining by leveraging the hyperlink structure of Wikipedia
Hu et al. Passage extraction and result combination for genomics information retrieval
Makris et al. Web query disambiguation using pagerank
Lv et al. MEIM: a multi-source software knowledge entity extraction integration model
KR102454261B1 (en) Collaborative partner recommendation system and method based on user information
Alagarsamy et al. A fuzzy content recommendation system using similarity analysis, content ranking and clustering
Hsu et al. Similarity search over personal process description graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201211

Termination date: 20220110

CF01 Termination of patent right due to non-payment of annual fee