CN109783650B

CN109783650B - Chinese network encyclopedia knowledge denoising method, system and knowledge base

Info

Publication number: CN109783650B
Application number: CN201910024995.8A
Authority: CN
Inventors: 王汀
Original assignee: CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Current assignee: CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2020-12-11
Anticipated expiration: 2039-01-10
Also published as: CN109783650A

Abstract

The invention relates to a Chinese network encyclopedia knowledge denoising method, a Chinese network encyclopedia knowledge denoising system and a knowledge base, and belongs to the technical field of computers. The Chinese network encyclopedia knowledge denoising method is based on the fusion of an editing distance and a synonym forest method, an Infobox knowledge Triple (Triple) data field is constructed by means of vocabulary labels of Chinese encyclopedia to denoise massive knowledge triples, the purpose is to reduce the phenomenon of massive repeated ambiguity in a Chinese open encyclopedia knowledge base as much as possible, and the problems that semantic repetition, synonym and knowledge triples are improperly classified and the like when the Chinese knowledge base is constructed in the prior art are solved.

Description

Chinese network encyclopedia knowledge denoising method, system and knowledge base

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a Chinese network encyclopedia knowledge denoising method, a Chinese network encyclopedia knowledge denoising system and a knowledge base.

Background

With the development and application of computer network, mobile internet and other technologies, the internet gradually becomes a main platform for people to publish, exchange and share information. The information inquiry, knowledge acquisition and skill learning gradually realize the conversion from offline to online, the storage structure is converted from text to semi-structured and formatted storage formats, novel information carriers such as online encyclopedia and encyclopedia websites are rapidly developed, and the storage data volume is rapidly accumulated and increased. The construction of knowledge bases as an important set of knowledge for storing, organizing and handling knowledge and for providing knowledge services is becoming the basis for various industries to develop knowledge management and knowledge services.

However, the Chinese network encyclopedia knowledge inventory has the problem of low efficiency and accuracy due to the fact that a large number of repeated and synonymous concepts exist among Chinese word concepts, the knowledge is not properly classified and the like.

Disclosure of Invention

In order to solve the problems of repeated semantics, improper classification of knowledge triples and the like in a Chinese network encyclopedia knowledge base in the prior art, the invention provides a Chinese network encyclopedia knowledge denoising method, a Chinese network encyclopedia knowledge denoising system and a knowledge base, and the method has the characteristics of high precision ratio and the like.

The invention provides the following technical scheme:

in one aspect, a method for denoising knowledge of Chinese network encyclopedia comprises the following steps:

collecting original data in the open encyclopedia resources of the Chinese network;

crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for the vocabulary entry to which a preset concept belongs based on the original data;

crawling entry labels of the Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;

calculating the initial similarity of the Infobox knowledge triple;

adding semantic distance to the Infobox knowledge triple label, and acquiring Infobox knowledge triple target similarity through a data field according to a preset method according to the initial similarity;

and carrying out knowledge denoising according to the similarity of the Infobox knowledge triple target.

Further optionally, after crawling and parsing the Infobox knowledge triples on the entry web page for the entry belonging to the preset concept based on the original data, the method further includes: each top-level large class contains a subclass ontology concept, and the subclass ontology concept contains the corresponding triple.

Further optionally, the method further comprises: all triplets of sub-category concepts are filtered in the form of labels Y1 or N according to semantic relationships.

Further optionally, the crawling and parsing of the Infobox knowledge triples on the entry web page on the entry belonging to the preset concept based on the original data includes: the crawler tool is used for crawling and analyzing the encyclopedia structured information contained in the open classification page and the entry page of the interactive encyclopedia and the encyclopedia.

Further optionally, the calculating the initial similarity of the Infobox knowledge triples includes: calculating a first initial similarity of the triples based on the edit distance;

calculating a second initial similarity of the triples based on the synonym forest;

and performing complementary fusion on the first initial similarity and the second initial similarity according to a preset mode to obtain the initial similarity.

Further optionally, the adding a semantic distance to the Infobox knowledge triple tag includes: and performing semantic distance calculation by traversing the Chinese encyclopedia classification tree.

Further optionally, the method further comprises: and introducing a pseudo-nuclear force field potential function with improved tag semantic distance.

Further optionally, the performing knowledge denoising according to the Infobox knowledge triple target similarity includes: and arranging the original documents and the documents processed by the improved data field algorithm from big to small according to the similarity, acquiring a preset number of original data, and performing knowledge denoising.

In another aspect, a system for denoising knowledge of chinese network encyclopedia is provided, where the system includes: the device comprises a collection module, an acquisition module, a calculation module and a knowledge denoising module.

The collection module is used for collecting original data in the open encyclopedia resources of the Chinese network;

the acquisition module comprises a first acquisition unit and a second acquisition unit; the first acquisition unit is used for crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for a vocabulary entry to which a preset concept belongs based on the original data; the second obtaining unit is used for crawling entry labels of the Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;

the computing module comprises a first computing unit and a second computing unit; the first computing unit is used for computing the initial similarity of the Infobox knowledge triple; the second computing unit is used for adding semantic distance to the Infobox knowledge triple label and acquiring Infobox knowledge triple target similarity through a data field according to the initial similarity and a preset method;

and the knowledge denoising module is used for denoising the knowledge according to the similarity of the Infobox knowledge triple target.

In another aspect, a knowledge base is constructed by applying any one of the above-mentioned denoising methods for the knowledge base.

The method, the system and the knowledge base for denoising the Chinese network encyclopedic knowledge provided by the embodiment of the invention are based on the fusion of an editing distance and a synonym forest method, a knowledge triple data field is constructed by means of a vocabulary entry tag to denoise massive knowledge triples, so that the phenomenon of massive repeated ambiguity in a Chinese open encyclopedic knowledge base is reduced as much as possible, and the problems that semantic repetition, synonymy and vocabulary entry Infobox knowledge triples are improperly classified when the Chinese knowledge base is constructed in the prior art are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a denoising method for network encyclopedia knowledge in an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a network encyclopedia knowledge denoising system in an embodiment of the present invention;

FIG. 3 is a schematic diagram of 41% N number-to-column comparison after the Chinese network encyclopedia knowledge denoising method provided by the present invention verifies the subclasses of the embodiments;

FIG. 4 is a schematic diagram of broken lines comparing 41% N numbers after the Chinese network encyclopedia knowledge denoising method provided by the present invention verifies the subclasses of the embodiments;

FIG. 5 is a schematic diagram of a 41% P-value comparison column after deletion in two stages according to an embodiment of the denoising method for encyclopedic knowledge of Chinese network provided by the present invention;

FIG. 6 is a schematic diagram of 41% P-value comparison broken lines after deletion in two stages of a verification embodiment of the Chinese network encyclopedia knowledge denoising method provided by the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

In order to more clearly illustrate the process and advantages of the method of the embodiment of the invention, the invention provides a Chinese network encyclopedia knowledge denoising method, which comprises the following steps:

crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for a vocabulary entry to which a preset concept belongs based on original data;

crawling entry labels of Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;

calculating initial similarity of Infobox knowledge triples;

adding semantic distance to the Infobox knowledge triple labels, and acquiring target similarity of the Infobox knowledge triples through a data field according to a preset method according to the initial similarity;

The Chinese network encyclopedia knowledge denoising method provided by the embodiment of the invention is based on the fusion of an editing distance and a synonym forest method, a knowledge triple data field is constructed by means of vocabulary entry labels to denoise massive knowledge triples firstly, so that the phenomenon of massive repeated ambiguity in a Chinese open encyclopedia knowledge base is reduced as much as possible, and the problems of semantic repetition, synonymy and improper classification of knowledge triples and the like when the Chinese knowledge base is constructed in the prior art are solved.

Based on the method for constructing the Chinese knowledge base, the embodiment of the invention provides an optional embodiment: fig. 1 is a flowchart of a network encyclopedia knowledge denoising method in an embodiment of the present invention. Referring to fig. 1, the method for denoising the chinese network encyclopedia knowledge in this embodiment may include the following steps:

and S11, collecting the original data in the open encyclopedia resources of the Chinese network.

Specifically, Chinese network encyclopedia resources are selected, for example, vocabulary entry web pages of encyclopedia and interactive encyclopedia are selected as raw data sources, and raw data are collected in the raw data sources.

S12, crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for the vocabulary entry to which the preset concept belongs based on the original data; .

Specifically, a crawler tool is used for crawling and analyzing the Infobox structured information contained in the open classification pages and the entry pages of the interactive encyclopedia and the encyclopedia, and the Infobox structured information is organized in a Chinese Triple (Triple) form to form a large-scale Chinese open domain knowledge base to be refined.

After crawling and analyzing the Infobox knowledge triples on the vocabulary entry web page for the vocabulary entry to which the preset concept belongs based on the original data, the method further comprises the following steps: organizing each top-level large class into an ontology concept hierarchical relationship containing a subclass concept, and then organizing the subclass ontology concept to contain the corresponding triple.

And S13, crawling entry labels of the Infobox knowledge triples contained in the preset subclasses, and adding the labels to the Infobox knowledge triples.

Specifically, on the basis of subclass concept instance triples, the labels of related terms are crawled from Chinese network encyclopedia and added to the triples corresponding to each term in the corresponding triple documents, and the triples are marked in an agreed format. In this embodiment, there are three layers of ontology concept structures in the Baidu encyclopedia classification tree. Selecting top-level classification in ontology concept set of Chinese network encyclopedias (namely: encyclopedia and interactive encyclopedia): geography, economy, science, history, personage, society, life, sports, culture, art, nature.

All triplets of sub-class concepts are first screened and labeled in the form of label Y1 or N according to semantic relationships. The labeling of Y1 and N may be performed manually. For example, the labeling of Y1 and N in this embodiment is performed by manual labeling, and the labeling principle is set to determine whether the attributes of the sub-categories related to the concept to which the triplets belong are semantically matched, and if so, the triplets are labeled as Y1, and if not, the triplets are labeled as N. Note that the manual labeling herein is not a limitation on the labeling method, but merely an example.

And S14, calculating initial similarity of the Infobox knowledge triples.

In the invention, the definition triple is respectively composed of a subject, a predicate and an object, and a knowledge triple is represented by < S, P and O >, wherein S represents the subject of the triple, P represents the predicate, and O represents the object. The initial similarity calculation is carried out between the encyclopedia subclass triple document and the interactive encyclopedia top-level large-class triple document corresponding to the encyclopedia subclass triple document. For example: to calculate the initial similarity of the encyclopedia subclass document "currency", it needs to perform the initial similarity calculation of the triples with the large class "economy" document of the interactive encyclopedia, because the "currency" subclass belongs to the large class "economy" in the encyclopedia, it needs to map all the triples in the triple document of the large class "economy" in the interactive encyclopedia to calculate the initial similarity and accumulate.

Specifically, when initial similarity calculation is performed, a first initial similarity is calculated based on an edit distance; calculating a second initial similarity based on the synonym forest; and performing complementary fusion on the first initial similarity and the second initial similarity according to a preset mode to obtain a third initial similarity.

In the initial similarity calculation, a calculation method with less resource requirement and high efficiency is considered, so that a similarity calculation method based on an editing distance is adopted. SIM algorithm by edit distance_ELiteral similarity between triples can be obtained while ignoring semantic relevance.

Setting any one triple B in a certain subclass B of Baidu encyclopedia_i＝<b_is,b_ip,b_io>Any triple H in the interactive hundred-family major document H corresponding to the document B_j＝<h_js,h_jp,h_jo>For the triplet b, then_iAnd h_jThe edit distance similarity between the subjects is given by formula (1). Here, the edit distance similarity between predicates and between objects can be obtained.

Wherein, | Step (b)_is,h_js) L is such that b_isAnd h_jsThe number of editing operation steps, len (b), required to be equal to each other_is) And len (h)_js) Meaning the word b_isAnd h_jsThe length of the number of characters.

And calculating to obtain the first initial similarity of the Baidu encyclopedia knowledge triple.

The synonym forest is compiled by Meijia colt and the like in 1983, is a Chinese synonym dictionary, is originally intended to provide more synonyms and is helpful for creation and translation work. The dictionary not only includes synonyms of a word, but also includes a certain number of similar words, namely related words in a broad sense. The synonym forest is a Chinese synonym dictionary, which encodes each vocabulary and organizes the vocabulary in a tree structure with five layers from top to bottom in a hierarchical relationship. Each layer is provided with a corresponding code mark, and the five layers of codes are sequentially arranged from left to right to form the word forest code of the lemma. Each node in the tree represents a concept, and the semantic relevance between the terms is improved along with the increase of the hierarchy. The concept coreference relation recognition of Chinese can be practically abstracted as a recognition problem of Chinese synonyms. In the present invention, we actually use an extended version thereof, namely: the forest of synonyms (expanded version) of Harbin Industrial university, as dictionary thesaurus for the second initial similarity calculation.

According to the structural characteristics of a word forest, firstly, the word forest codes of a subject, a predicate and an object in a triple are analyzed, first to fifth layer sub-codes are extracted, and then comparison is started from the first layer sub-codes. If the sub-codes appear differently, the mapping pair is given corresponding similarity weight according to the appearing hierarchy. The deeper the sub-codes appear differently, the higher the similarity weight, and vice versa. Meanwhile, the number of branch nodes in each layer also has an influence on the similarity. In this embodiment, we adopt an improved similarity calculation formula, and still take the similarity calculation between the subjects of the triples as an example, and the similarity calculation between predicates and objects can be obtained by the same method as that shown in formula (2).

Wherein, λ is a semantic relevancy factor of the adjusting parameter, so as to control the possible similarity degree between the lemmas at different levels of branches, and λ belongs to (0, 1); l ═ 1, 2, 3, 4, 5, for

L_nIs the number of layers represented by the nth layer, | L | is the number of elements in the set L, and is equal to 5 in the system. N is a radical of_TIs the word b_isAnd h_jsTotal number of nodes on the n-th level branch, D being the term b_isAnd h_jsThe coding distance of (1).

Through calculation, a second initial similarity of the hundred-degree encyclopedia knowledge triple can be obtained.

Due to the SIM_EAlgorithm and SIM_TThe algorithms have semantic complementarity, so that the similarity results of the two algorithms are subjected to complementary fusion in the embodiment, and the maximum value of the results of the two algorithms is taken.

The invention provides any one triple B in an encyclopedia subclass document B_iThe initial similarity calculation of (2) is shown by equation (3):

wherein 0.3, 0.5 and 0.2 are subject similarity, predicate similarity and object similarity, respectively, in the whole triplet b_iThe weight coefficient occupied during the initial similarity calculation can be adjusted according to the target effect.

The specific algorithm is as follows:

InterlinkingValue(B,H)

inputting: baidu encyclopedia certain subclass triple document B and corresponding interactive encyclopedia large class triple document H

And (3) outputting: triple initial similarity hash table Map _ B < Key, Value >

And calculating to obtain the initial similarity of the knowledge triples corresponding to the encyclopedia certain subclass document B.

S15, adding semantic distance to the Infobox knowledge triple label, and obtaining the Infobox knowledge triple target similarity through a data field according to a preset method according to the initial similarity.

It should be noted that a field mathematically refers to the mapping of one vector to another vector or number. In physics, a field refers to a region of space where each point is subjected to a force. The initial field mainly refers to a physical field such as a magnetic field, an electric field, a gravitational field, and the like. In the above-mentioned physical fields, the interaction between particles is usually described by means of a vector field strength function and a scalar potential function. Similar to the physical field, a vector field strength function and a scalar potential function may also be defined in the data field. The data field theory is provided based on the field theory thought in physics, and the mutual relation among data in a number domain space is abstracted into the interaction problem among substance particles, and finally is formed into a description method of the field theory. The theory expresses the interaction relation among different data through a potential function, thereby reflecting the distribution characteristics of the data and clustering and dividing the data set according to an equipotential line structure in a data field.

And when the triple target similarity is calculated, adding semantic distance to the label. In the embodiment, when the semantic distance is added to the label, a concept, namely a potential function, is introduced.

Suppose F is generated for data in DData field, function f_X(Y) is a potential function thereof, wherein X ∈ D and Y ∈ Ω. It indicates the potential value, f, of the data element X at Y_X(Y) must satisfy the following condition: (1) f. of_X(Y) is a continuous, smooth, bounded function; (2) f. of_X(Y) has isotropy; (3) f. of_X(Y) is a decreasing function with respect to the distance X-Y, when 0, f_X(Y) taking the maximum value; when | | | X-Y | | → ∞, f_X(Y) → 0. In this embodiment, a more general potential function example is listed:

pseudo nuclear force field potential function:

wherein m.gtoreq.0 represents the influence strength of X on Y, and can be understood as the mass of X.

Called impact factors, which determine the impact range of the element. When in use

When the value of the potential function increases.

In the present invention, assume a certain triplet b_iIf the label set of (a) is T, then T ═ T₁,t₂,t₃……t_n)，n>0 denotes the number of tags, t_iIs a label of the circle center. It should be noted that, as a common general knowledge, all the vocabulary entry classification labels mentioned in the present invention belong to the concept set in the encyclopedia open classification system. In this embodiment, the shortest path length between two tags is d ═ t_i-t_jL. Triple b obtained based on data field theory_iCircle center label t_iWith other labels t_jThe field strength function expression of the interaction is as follows:

wherein, S _ b_iAnd representing the initial similarity of the triples i associated with the sub-class concepts belonging to different major classes corresponding to the labels. The main categories of encyclopedia and interactive encyclopedia are: the top 11 broad classes in the encyclopedia and interactive encyclopedia classification trees are: character, sports, life, culture, science, economy, history, society, geography, nature, art.

And adding semantic distances to all the entry labels of each knowledge triple. If tag and triplet b_iThe subclasses of the same concept are circle center labels, and the distance value is 0. The semantic distance between the labels is the distance between other labels in the same large concept set and a circle center label. We stipulate that if the current label belongs to the current top-level large ontology concept set, the current label must be the parent or child concept of the circle-center label, and if the current label is the direct parent or child concept of the circle-center label, the semantic distance is 1, because the maximum depth of the Baidu encyclopedia classification tree is 2, and so on, the maximum semantic distance between labels is 6.

For example, the triple under a river (suzhou river/chinese name/suzhou river) is taken as an example, because the suzhou river is a river but carries song, movie and drama labels. Geography is the direct father of the river and is also the top level large class, so the semantic distance between geography and the river is 1. Songs and scenarios are subclasses of top-level major life, while songs and scenarios are subclasses of leisure major, so the path between the river and the song can be represented as river-geography-root-life-leisure-song (scenarios), and thus the semantic distance is 5.

The invention punishs the tags which do not belong to the current main ontology concept, and aims to weaken the initial similarity of the triples containing the tags with longer semantic distance so as to carry out secondary sequencing based on the similarity of the target, and the triples with the ranked target are rejected out of a knowledge base. Therefore, field intensity calculation is carried out based on the tags in the triplets, the obtained target similarity is the result of correcting the initial similarity, and the optimized piecewise function formula (6) is as follows:

wherein the content of the first and second substances,

is the piecewise function we have built. Sign representing current label t_jAnd the circle center label t_iWhether the tag pairs are in the same major class or not, and if the tag pairs are in the same major class, the tag pairs represent the triples b_iHas positive acting force; if not, the tag pair triple b is represented_iThere is a counter-acting force.

And finally, giving an Baidu encyclopedic triple target similarity calculation formula (7):

the specific algorithm is as follows:

algorithm Choose (O, Map _ B, T)

Inputting: baidu encyclopedia body O, initial similarity set Map _ B of triples and tag set T

And (3) outputting: triple target similarity set Map _ B 'behind data field'

Sequencing the original document and the document processed by the data field from big to small according to the similarity, so that the method has a good effect on the rank reduction of the marked N, namely the incorrect triple, and is more beneficial to denoising and optimizing a knowledge base.

And S16, carrying out knowledge denoising according to the similarity of the Infobox knowledge triple target.

And after the initial similarity is subjected to data field processing, sorting in a descending order according to the target similarity value. For a sub-concept document, according to the thought of the golden section point, the number of the triples marked with N in the last 41% of the triples is obtained, the triples are compared with the number of N in the last 41% of the initial similarity without data field processing, and the result is reasonably analyzed and compared. Specifically, according to the calculated target similarity, similarity ranking is performed, and ranked triples with a predetermined proportion are removed. For example, triples ranked 41% later are removed.

The Chinese network encyclopedia knowledge denoising method provided by the embodiment of the invention is based on the fusion of an editing distance and a synonym forest method, a knowledge triple data field is constructed by means of vocabulary entry labels to denoise massive knowledge triples firstly, so that the phenomenon of massive repeated ambiguity in a Chinese open encyclopedia knowledge base is reduced as much as possible, and the problems of semantic repetition, synonymy, improper classification of knowledge triples and the like when the Chinese knowledge base is constructed facing network encyclopedia in the prior art are solved.

In order to further explain the technical scheme, the invention also provides an embodiment of a Chinese network encyclopedia knowledge denoising system.

Fig. 2 is a schematic structural diagram of a network encyclopedia knowledge denoising system in an embodiment of the present invention. Referring to fig. 2, the system for denoising chinese network encyclopedia knowledge in the embodiment of the present invention includes: a collection module 21, an acquisition module 22, a calculation module 23 and a denoising module 24.

The collection module 21 is configured to collect raw data from the open encyclopedia resource of the chinese network;

an acquiring module 22, including a first acquiring unit 221 and a second acquiring unit 222; the first obtaining unit 221 is configured to perform crawling and parsing on an Infobox knowledge triple on a vocabulary entry web page for a vocabulary entry to which a preset concept belongs based on original data; the second obtaining unit 222 is configured to crawl entry tags of the Infobox knowledge triples included in the preset subclasses, and add the tags to the Infobox knowledge triples;

the calculation module 23 includes a first calculation unit 231 and a second calculation unit 232; the first calculating unit 231 is used for calculating initial similarity of the Infobox knowledge triples; the second calculating unit 232 is configured to add a semantic distance to the Infobox knowledge triple tag, and obtain an Infobox knowledge triple target similarity according to the initial similarity and a preset method through a data field;

the knowledge denoising module 24 is configured to perform knowledge denoising according to the similarity of the target of the Infobox knowledge triplet.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The Chinese network encyclopedia knowledge denoising system provided by the embodiment of the invention is based on the fusion of an editing distance and a synonym forest method, a knowledge triple data field is constructed by means of vocabulary entry labels to denoise massive knowledge triples firstly, so that the phenomenon of massive repeated ambiguity in a Chinese open encyclopedia knowledge base is reduced as much as possible, and the problems of semantic repetition, synonymy, improper classification of knowledge triples and the like when a network encyclopedia-oriented Chinese knowledge base is constructed in the prior art are solved.

In order to further explain the technical scheme, the invention also provides an embodiment of a knowledge base.

The knowledge base provided by the embodiment is denoised by applying the Chinese network encyclopedia knowledge denoising method.

The knowledge base of the embodiment is constructed based on the combination of the editing distance and the synonym forest method, the knowledge triple data field is constructed by means of the entry labels to denoise massive knowledge triples, the phenomenon of massive repeated ambiguity in the Chinese open encyclopedia knowledge base is reduced as much as possible, and the problems that semantic repetition, synonymy and improper classification of the knowledge triples are caused when the Chinese knowledge base is constructed in the prior art are solved.

In order to verify the effect of the construction of the Chinese knowledge base, the invention also provides a verification embodiment.

Data sets are extracted from Chinese network open encyclopedias such as encyclopedia and interactive encyclopedia, and two experiments are designed:

(1) the number of the last 41% N in each sub-class data after the first stage and the second stage is compared.

(2) The P values 41% after deletion of each subclass were compared for both phases and completely unprocessed.

And (4) observing the effect of the stage treatment by the two comparison methods, and obtaining evaluation and conclusion. Meanwhile, all programs of this embodiment are implemented by JAVA coding in a stable system environment.

The embodiment adopts a Chinese network encyclopedia open encyclopedia knowledge base as an experimental data source. In the embodiment, a crawler toolkit HTMLParser is used for crawling and analyzing the Infobox structured information contained in the open classification pages and the entry pages of the encyclopedia and the interactive encyclopedia respectively, and organizing the Infobox structured information in a Chinese triple form to form an initial large-scale Chinese open domain knowledge base. And then dividing the crawled data into top level large classes in 11 ontology concept sets, wherein each top level large class comprises subclass ontology concepts, and each subclass ontology concept comprises corresponding triples. Specifically, the information of the encyclopedia knowledge base in table 1 is shown.

TABLE 1 Chinese encyclopedia knowledge base information

Referring to table 1, the number of the 11 top-level major classes, the subclasses of the major classes, and the number of the term triples in the corresponding subclasses, and the number of Y1, N, can be seen. Wherein, Y1 and N are the result of manual labeling, and the principle is to see whether the triple matches with the related subclass property, if so, Y1 is labeled, otherwise, N is labeled.

And performing initial similarity assignment on each triple after calculating the editing distance and the synonym forest similarity, and performing file merging on Y1 or N which is artificially labeled on the triple and labels and subclass concepts of related concepts extracted from the open encyclopedia of the Chinese network to form an input data set processed by the improved data field algorithm provided by the invention at the next stage.

And sorting the triples in the merged subclass concept file from big to small according to the initial similarity.

And further optimizing and correcting the initial similarity based on the data field to finally obtain the target similarity.

Evaluation of experimental data:

first is Precision (Precision) for the selected triples:

p is the number of Y1 exported/the total number exported X100%

The experiment in this document aims at denoising, i.e. requires the accuracy of the triplets in the document, and the effect of this system can be fully satisfied by observing the precision ratio by extracting the triplets labeled Y1.

A plurality of subclasses under the top-level classification in the Chinese network encyclopedia concept set are selected, and the P is found through comparison to better reflect the efficiency before and after the algorithm, so that the evaluation target is shown in the table 2.

TABLE 2. Chinese network encyclopedia ontology mapping evaluation statistical table

Results of the experiment

(1) Experiment 1:

after data field processing is carried out on the triple labels, the obtained similarity and the completely unprocessed similarity are sorted from large to small, then labels N of the last 41% of the triples in the respective subclass concept documents are respectively taken, and the number change of the N is compared. FIG. 3 is a column diagram showing 41% N number comparisons after the subclass of the embodiment of the method for constructing a Chinese knowledge base provided by the present invention is verified. FIG. 4 is a schematic diagram of broken lines comparing N numbers in 41% after the subclass of the embodiment of the method for constructing a Chinese knowledge base provided by the present invention is verified.

TABLE 3 variation of the number of N in 41% after subclass before and after data field

Referring to table 3, the number of N in the last 41% of most subclass documents is increased after the data field, and the effect is more significant, most notably, the number of N is increased by 347. Meanwhile, the number of N in the three subclasses of documents, such as oceans, banks and buildings, is reduced, but the reduction degree is small, and considering that the number of the total N is increased by 761, the N can be ignored.

Referring to fig. 3 and 4, it is apparent that the polyline representing the back of a data field is generally above the polyline representing the front of the data field. Therefore, it can be determined that the number of N in the last 41% of the sub-class documents is increased as a whole after the data field processing, because the data field processing can improve certain accuracy for the construction of the knowledge base of the encyclopedia opened by the Chinese network.

(2) Experiment 2

The precision ratio is an important index for evaluating the information retrieval effect, and the experiment divides the original completely unprocessed subclass file data into two stages: the first is a stage after sorting according to the initial similarity from big to small; the second is a stage of sorting according to similarity from large to small after processing based on the data field. In the two stages, the original completely unprocessed data is firstly deleted, then 41% of subclass document triples are deleted, and then the P value of the rest triples is obtained for periodic comparison.

The experimental results are shown in the figure: FIG. 5 is a schematic diagram of 41% P-value comparison columns after deletion in two stages of the verification embodiment of the method for constructing a Chinese knowledge base provided by the present invention. FIG. 6 is a schematic diagram of a 41% comparison polyline of P values after deletion in two stages of the verification embodiment of the method for constructing a Chinese knowledge base provided by the present invention.

Referring to fig. 5 and 6, most of P values are increased along with the increase of the stage, but the P values of some sub-class documents such as events, languages, calligraphy and painting are lower than those of the original data in the first stage, which means that the P values can be increased only slightly in the first stage, but the efficiency is not high. Then, table 4 is made again to list the P values of the three stages of each subclass document, and meanwhile, the percentage of the P value improvement of the third stage compared with the original data is compared in the last column, so that it is obvious from the data that the P value after data field processing is basically improved remarkably compared with the P value when the data field is not processed, only the subclass document buildings are reduced slightly, which is also negligible compared with the improvement of the overall P value, so that the method based on data field processing can be judged to improve the accuracy of the knowledge base construction.

TABLE 4P value 41% after two stage deletion

Conclusion of the experiment

The above experimental results show that: after the triples are processed based on the improved data field, the initial similarity is optimized and corrected, the ambiguity problem in the open encyclopedia of the Chinese network and the problem that the knowledge triples are not properly classified are effectively avoided to a certain extent, the accuracy of establishing the Chinese knowledge base is finally improved, a large number of related ontology elements can be reserved, and the concept recall rate is ensured.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the present invention shall be covered thereby. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A Chinese network encyclopedia knowledge denoising method is characterized by comprising the following steps:

based on the original data, crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for the vocabulary entry to which the preset concept belongs, wherein the crawling and analyzing comprises the following steps: crawling and analyzing the structural information Infobox of the interactive encyclopedia and encyclopedia entry web pages contained in the open classification pages and entry pages of the interactive encyclopedia and encyclopedia by using a crawler tool;

each encyclopedic top-level large class contains a subclass ontology concept, and the subclass ontology concept contains the corresponding triple;

each knowledge triple is < S, P, O >, wherein S represents a subject of the triple, P represents a predicate, and O represents an object;

calculating the initial similarity of the Infobox knowledge triple, comprising the following steps: calculating a first initial similarity of the triples based on the edit distance; calculating a second initial similarity of the triples based on the synonym forest; performing complementary fusion on the first initial similarity and the second initial similarity according to a preset mode to obtain initial similarity;

the first initial similarity includes similarity between the subjects, similarity between the predicates, and similarity between the objects; the similarity between the subjects is as follows: triplet b_iAnd h_jThe edit distance similarity between the subjects is given by equation (1):

wherein, | Step (b)_is,h_js) L is such that b_isAnd h_jsThe number of editing operation steps, len (b), required to be equal to each other_is) And len (h)_js) Meaning the word b_isAnd h_jsThe length of the number of characters of (a); any one triple B in a certain subclass of Baidu encyclopedia documents B_i＝<b_is,b_ip,b_io>Wherein b is_isRepresenting a triplet b_iSubject of inclusion, b_ipRepresenting a triplet b_iContaining predicates, b_ioRepresenting a triplet b_iAn object contained; any triple H in interactive hundred-family major documents H corresponding to document B_j＝<h_js,h_jp,h_jo>Wherein h is_jsRepresenting a triplet h_jSubject of inclusion, h_jpRepresenting a triplet h_jContaining predicate, h_joRepresenting a triplet h_jAn object contained;

the second initial similarity includes: a second similarity between the subjects, a second similarity between the predicates, and a second similarity between the objects; what is needed isThe second similarity between the subjects is: triplet b_iAnd h_jThe second similarity between the subjects is given by equation (2):

L_nIs the number of layers represented by the nth layer, | L | is the number of elements in the set L, and is always equal to 5 in the system; n is a radical of_TIs the word b_isAnd h_jsTotal number of nodes on the n-th level branch, D being the term b_isAnd h_jsThe coding distance of (a);

any one triple B in the Baidu encyclopedia subclass document B_iThe initial similarity calculation of (2) is shown by equation (3):

wherein 0.3, 0.5 and 0.2 are subject similarity, predicate similarity and object similarity, respectively, in the whole triplet b_iThe weight coefficient occupied during the initial similarity calculation;

the adding of the semantic distance to the Infobox knowledge triple label comprises the following steps: performing semantic distance calculation by traversing the Chinese encyclopedia classification tree;

further comprising: introducing a pseudo-nuclear force field potential function with improved tag semantic distance;

let F be the data field generated by the data in D, function F_X(Y) is a potential function, wherein X belongs to D, and Y belongs to omega; it indicates the potential value, f, of the data element X at Y_X(Y) satisfies the following condition: (1) f. of_X(Y) is a continuous, smooth, bounded function; (2) f. of_X(Y) has isotropy; (3) f. of_X(Y) is a decreasing function with respect to the distance X-Y, when 0, f_X(Y) taking the maximum value; when | | | X-Y | | → ∞, f_X(Y)→0；

Pseudo nuclear force field potential function:

wherein m is more than or equal to 0, which represents the influence intensity of X on Y and can be understood as the mass of X;

called influence factor, determines the influence range of the element; when in use

When the potential function value is increased, the potential function value is increased;

a certain triplet b_iIf the label set of (a) is T, then T ═ T₁,t₂,t₃……t_n)，n>0 denotes the number of tags, t_iIs a label of the circle center;

the shortest path length between two tags is d ═ t_i-t_jL, |; triple b obtained based on data field theory_iCircle center label t_iWith other labels t_jThe field strength function expression of the interaction is as follows:

and correcting the initial similarity, wherein the optimized piecewise function formula (6) is as follows:

wherein the content of the first and second substances,

is a piecewise function; sign representing current label t_jAnd the circle center label t_iWhether the tag pairs are in the same major class or not, and if the tag pairs are in the same major class, the tag pairs represent the triples b_iHas positive acting force; if not, the tag pair triple b is represented_iThe reverse acting force exists;

baidu encyclopedia triple target similarity calculation formula (7):

and denoising the knowledge according to the similarity of the Infobox knowledge triple target, comprising the following steps: arranging the original documents and the documents processed by the improved data field algorithm from big to small according to the similarity, acquiring a preset number of original data, and performing knowledge denoising;

further comprising: screening all triples of the subclass concepts in a mode of labeling Y1 or N according to semantic relations, wherein Y1 and N are results of manual labeling, the principle is to see whether the triples are matched with the related subclass attributes, if so, labeling Y1, and otherwise, labeling N;

after the initial similarity is subjected to data field processing, sorting in a descending order according to the target similarity; and according to the calculated target similarity, carrying out similarity ranking and removing the triples ranked at the last 41%.

2. A Chinese network encyclopedia knowledge denoising system, which is characterized by comprising: the device comprises a collecting module, an obtaining module, a calculating module and a knowledge denoising module;

the acquisition module comprises a first acquisition unit and a second acquisition unit; the first obtaining unit is used for crawling and analyzing an Infobox knowledge triple on a vocabulary entry web page for a vocabulary entry to which a preset concept belongs based on the original data, and comprises: crawling and analyzing the structural information Infobox of the interactive encyclopedia and encyclopedia entry web pages contained in the open classification pages and entry pages of the interactive encyclopedia and encyclopedia by using a crawler tool; each encyclopedic top-level large class contains a subclass ontology concept, and the subclass ontology concept contains the corresponding triple; each knowledge triple is < S, P, O >, wherein S represents a subject of the triple, P represents a predicate, and O represents an object;

the second obtaining unit is used for crawling entry labels of the Infobox knowledge triples contained in a preset subclass, and adding the labels to the Infobox knowledge triples;

the computing module comprises a first computing unit and a second computing unit; the first computing unit is used for computing the initial similarity of the Infobox knowledge triples, and comprises the following steps: calculating a first initial similarity of the triples based on the edit distance; calculating a second initial similarity of the triples based on the synonym forest; performing complementary fusion on the first initial similarity and the second initial similarity according to a preset mode to obtain initial similarity;

wherein, | Step (b)_is,h_js) L is such that b_isAnd h_jsThe number of editing operation steps, len (b), required to be equal to each other_is) And len (h)_js) Meaning the word b_isAnd h_jsThe length of the number of characters of (a); baidu Baike (one of hundred departments)Any triple B in subclass document B_i＝<b_is,b_ip,b_io>Wherein b is_isRepresenting a triplet b_iSubject of inclusion, b_ipRepresenting a triplet b_iContaining predicates, b_ioRepresenting a triplet b_iAn object contained; any triple H in interactive hundred-family major documents H corresponding to document B_j＝<h_js,h_jp,h_jo>Wherein h is_jsRepresenting a triplet h_jSubject of inclusion, h_jpRepresenting a triplet h_jContaining predicate, h_joRepresenting a triplet h_jAn object contained;

the second initial similarity includes: a second similarity between the subjects, a second similarity between the predicates, and a second similarity between the objects; the second similarity between the subjects is: triplet b_iAnd h_jThe second similarity between the subjects is given by equation (2):

the second computing unit is used for adding semantic distance to the Infobox knowledge triple label and acquiring Infobox knowledge triple target similarity through a data field according to the initial similarity and a preset method; the adding of the semantic distance to the Infobox knowledge triple label comprises the following steps: performing semantic distance calculation by traversing the Chinese encyclopedia classification tree;

further comprising: introducing a module; the introducing module is used for introducing a pseudo-nuclear force field potential function with improved tag semantic distance;

Pseudo nuclear force field potential function:

further comprising: a correction module;

the correction module is used for correcting the initial similarity, and the optimized piecewise function formula (6) is as follows:

wherein the content of the first and second substances,

the first calculating unit is further configured to calculate an Baidu encyclopedia triple target similarity, and calculate formula (7):

the knowledge denoising module is used for denoising the knowledge according to the similarity of the Infobox knowledge triple target, and comprises: arranging the original documents and the documents processed by the improved data field algorithm from big to small according to the similarity, acquiring a preset number of original data, and performing knowledge denoising;

the knowledge denoising module is further used for screening all triples of the subclass concepts in a form of labeling Y1 or N according to the semantic relationship, wherein Y1 and N are results of manual labeling, the principle is to see whether the triples are matched with the related subclass attributes, if so, Y1 is labeled, otherwise, N is labeled; after the initial similarity is subjected to data field processing, sorting in a descending order according to the target similarity; and according to the calculated target similarity, carrying out similarity ranking and removing the triples ranked at the last 41%.

3. A knowledge base, wherein the construction of the knowledge base applies the Chinese network encyclopedia knowledge denoising method of claim 1.