CN109508420B

CN109508420B - Method and device for cleaning attributes of knowledge graph

Info

Publication number: CN109508420B
Application number: CN201811415629.7A
Authority: CN
Inventors: 岳聪
Original assignee: Beijing Yufanzhi Information Technology Co ltd
Current assignee: Beijing Yufanzhi Information Technology Co ltd
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2021-12-07
Anticipated expiration: 2038-11-26
Also published as: CN109508420A

Abstract

The embodiment of the invention discloses a method and a device for cleaning attributes of a knowledge graph, which relate to the technical field of knowledge graphs. The embodiment of the invention adopts the main technical scheme that: when a knowledge graph is received, extracting a triple from the knowledge graph through analysis of the knowledge graph, wherein the triple is an element combination consisting of an entity element, an attribute element associated with the entity element and an attribute value element corresponding to the attribute element; constructing a query problem corresponding to the triple by using a preset template; searching the query problem on the Internet through a search engine to obtain a corresponding search result; and if the attribute value elements contained in the triples are determined to be wrong according to the search result, deleting the attribute value elements. The embodiment of the invention is mainly applied to automatically checking and cleaning the attribute values of the knowledge graph.

Description

Method and device for cleaning attributes of knowledge graph

Technical Field

The embodiment of the invention relates to the technical field of knowledge graphs, in particular to a method and a device for cleaning the attributes of a knowledge graph.

Background

Knowledge map (KG) is also called scientific Knowledge map, is called Knowledge domain visualization or Knowledge domain mapping map in the book information world, is a series of different graphs for displaying Knowledge development process and structure relationship, describes Knowledge resources and carriers thereof by using visualization technology, and excavates, analyzes, constructs, draws and displays Knowledge and mutual relation between Knowledge resources and Knowledge carriers.

Currently, external data sources with high structured credibility can be obtained to directly construct a knowledge graph, such as: baidu encyclopedia and the like, and the knowledge graph is simpler and more convenient to construct because the external data sources are easier to obtain. However, since the external data source also contains information edited by human, editing errors cannot be avoided, and the existing measures are to manually check the constructed knowledge graph, which not only consumes labor cost, but also reduces checking efficiency.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for cleaning an attribute of a knowledge graph, and mainly aim to replace manual operations with automatic inspection, optimize an inspection process of the attribute of the knowledge graph, improve inspection efficiency, and directly and automatically clean an attribute value error.

In order to achieve the above purpose, the embodiments of the present invention mainly provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides a method for cleaning knowledge graph attributes, where the method includes:

when a knowledge graph is received, extracting a triple from the knowledge graph through analysis of the knowledge graph, wherein the triple is an element combination consisting of an entity element, an attribute element associated with the entity element and an attribute value element corresponding to the attribute element;

constructing a query problem corresponding to the triple by using a preset template;

searching the query problem on the Internet through a search engine to obtain a corresponding search result;

and if the attribute value elements contained in the triples are determined to be wrong according to the search result, deleting the attribute value elements.

Optionally, the determining, according to the search result, that the attribute value element included in the triple is incorrect includes:

counting the total number of the search results corresponding to the query question;

crawling data information corresponding to the search result;

judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in the data information corresponding to the search result;

if so, marking the search result as a valid search result;

calculating a percentage of the number of valid search results to the total number;

determining whether the attribute value elements contained by the triplets are erroneous according to the percentage of the number of the valid search results to the total number.

Optionally, if a search engine searches for a query problem corresponding to the triple, determining whether the attribute value element included in the triple is incorrect according to the percentage of the number of the valid search results to the total number includes:

acquiring the percentage of the number of the effective search results corresponding to the search engine to the total number;

judging whether the percentage of the number of the effective search results to the total number is smaller than a first preset threshold value or not;

and if so, determining that the attribute value element contained in the triple is wrong.

Optionally, if the query questions corresponding to the triplets are respectively searched by a plurality of search engines, determining whether the attribute value elements included in the triplets are incorrect according to the percentage of the number of the valid search results to the total number includes:

acquiring the percentage of the number of the effective search results corresponding to the plurality of search engines respectively to the total number;

according to the weights distributed to the plurality of search engines in advance, performing weighting processing on the percentages corresponding to the plurality of search engines to obtain the percentages after weighting processing;

judging whether the weighted percentage is smaller than a first preset threshold value or not;

Optionally, before calculating the percentage of the effective number of search bars to the total number of search bars, the method further includes:

if the attribute elements and the attribute value elements contained in the triples are found in the data information corresponding to the search results and the entity elements contained in the triples are not found, extracting the nouns contained in the data information corresponding to the search results;

judging whether the noun is an alias of an entity element contained in the triple;

and if so, marking the search result as a valid search result.

Optionally, after the search result is marked as a valid search result, the method further includes:

acquiring answer information contained in data information corresponding to the search result and approval information corresponding to the answer information;

counting the total praise number corresponding to the search result according to the answer information and the praise information corresponding to the answer information;

and if the total number of praise is less than a second preset threshold value, deleting the search result from the effective search results.

Optionally, the method further includes:

judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in the header information corresponding to the effective search results;

if so, marking the title corresponding to the effective search result as a query problem to be selected;

determining the ordering sequence of the query problems to be selected according to the ordering sequence of the effective search results displayed by the webpage;

and selecting a preset number of the query problems to be selected as the newly added query problems corresponding to the triples according to the sequence of the query problems to be selected.

Optionally, the searching the query question on the internet through a search engine further includes:

and searching the query questions corresponding to the triples on the question-answer community of the Internet through a search engine.

In a second aspect, an embodiment of the present invention further provides a cleaning apparatus for knowledge-graph attributes, where the apparatus includes:

the extracting unit is used for extracting a triple from the knowledge graph through analysis of the knowledge graph when the knowledge graph is received, wherein the triple is an element combination consisting of an entity element, an attribute element associated with the entity element and an attribute value element corresponding to the attribute element;

the construction unit is used for constructing the query problem corresponding to the triple extracted by the extraction unit by using a preset template;

the searching unit is used for searching the query question constructed by the constructing unit on the Internet through a searching engine to obtain a corresponding searching result;

the determining unit is used for determining whether the attribute value elements contained in the triples are wrong or not according to the search results searched by the searching unit;

and the deleting unit is used for deleting the attribute value element when the attribute value element contained in the triple is determined to be wrong according to the determining unit.

Optionally, the determining unit includes:

the statistic module is used for counting the total number of the search results corresponding to the query question;

the crawling module is used for crawling data information corresponding to the search result;

the judging module is used for judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in data information corresponding to the search results crawled by the crawling module;

the marking module is used for marking the search result as an effective search result when the judging module judges that the entity elements, the attribute elements and the attribute value elements contained in the triples exist in the data information corresponding to the search result;

the calculating module is used for calculating the percentage of the number of the effective search results marked by the marking module to the total number counted by the counting module;

a determining module, configured to determine whether the attribute value elements included in the triple are incorrect according to the percentage of the number of valid search results calculated by the calculating module to the total number;

optionally, the determining module includes:

the obtaining submodule is used for obtaining the percentage of the number of the effective search results corresponding to the search engine to the total number;

the judging submodule is used for judging whether the percentage of the number of the effective search results acquired by the acquiring submodule to the total number is smaller than a first preset threshold value or not;

and the determining submodule is used for determining that the attribute value elements contained in the triples are wrong when the judging submodule judges that the percentage is smaller than a first preset threshold value.

Optionally, the determining module includes:

the obtaining submodule is further configured to obtain the percentage of the number of valid search results corresponding to the plurality of search engines to the total number of valid search results;

the processing submodule is used for performing weighting processing on the percentages corresponding to the plurality of search engines acquired by the acquisition submodule according to weights distributed to the plurality of search engines in advance to obtain the percentages after the weighting processing;

the judging submodule is used for judging whether the weighted percentage obtained by the processing submodule is smaller than a first preset threshold value or not;

the determining submodule is configured to determine that the attribute value element included in the triple is incorrect when the determining submodule determines that the percentage after the weighting processing is smaller than a first preset threshold.

Optionally, the determining unit further includes:

before calculating the percentage of the effective number of search bars to the total number of search bars, if the attribute elements and the attribute value elements included in the triples are found in the data information corresponding to the search results and the entity elements included in the triples are not found, extracting the nouns included in the data information corresponding to the search results;

the judging module is further configured to judge whether the noun extracted by the extracting module is an alias of the entity element included in the triple;

the marking module is further configured to mark the search result as a valid search result when the determining module determines that the noun is an alias of the entity element included in the triple.

Optionally, the determining unit further includes:

the acquisition module is used for acquiring answer information contained in data information corresponding to the search result and praise information corresponding to the answer information after the search result is marked as an effective search result;

the counting module is further configured to count a total number of praise corresponding to the search result according to the answer information and the praise corresponding to the answer information;

and the deleting module is used for deleting the search result from the effective search result if the total praise number counted by the counting module is smaller than a second preset threshold value.

Optionally, the apparatus further comprises:

the judging unit is used for judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in the header information corresponding to the effective search result;

the marking unit is used for marking the title corresponding to the effective search result as a query problem to be selected when the judging unit judges that the title information corresponding to the effective search result contains the entity element, the attribute element and the attribute value element contained in the triple;

the determining unit is further configured to determine the ordering order of the query questions to be selected according to the order of the effective search results displayed by the web page;

the determining unit is further configured to select a preset number of the query problems to be selected as the newly added query problems corresponding to the triplet according to the sequence of the query problems to be selected.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

at least one processor;

and at least one memory, bus connected with the processor; wherein the content of the first and second substances,

the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the method of cleaning of knowledge-graph attributes as described above.

In a fourth aspect, embodiments of the present invention also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method for cleaning of knowledge-graph attributes as described above.

By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:

the embodiment of the invention provides a method and a device for cleaning knowledge graph attributes. The embodiment of the invention constructs a query problem for triples (namely, entities, attributes and attribute values) of a knowledge graph by using a preset template, searches the query problem on the Internet through a search engine and obtains a search result, and further automatically judges whether the attribute values contained in the triples are wrong or not by using the known data information containing the search result of the triples, so that the obvious attribute values can be automatically cleaned by mistake. Compared with the prior art, the problem that whether errors exist in knowledge map attributes or not through manual inspection, labor cost is consumed, and efficiency is low is solved. According to the embodiment of the invention, manual operation is replaced by automatic inspection, the inspection process of the attribute of the knowledge graph is optimized, the inspection efficiency is improved, and the attribute value error can be directly and automatically cleaned.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow diagram illustrating a method for cleaning knowledge-graph attributes provided by embodiments of the present invention;

FIG. 2 is a flow diagram illustrating another method for cleaning knowledge-graph attributes provided by embodiments of the present invention;

FIG. 3 is a block diagram illustrating the components of a cleaning apparatus for knowledge-graph attributes provided in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram illustrating the components of another cleaning device with knowledge-map attributes provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a cleaned electronic device for knowledge-graph attribute cleaning according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the embodiments to those skilled in the art.

The embodiment of the invention provides a method for cleaning knowledge graph attributes, as shown in fig. 1, the method constructs a query problem of a triple composed of entity elements, attribute elements and attribute value elements, and judges whether the attribute value elements contained in the triple have errors or not by searching data information of a search result corresponding to the query problem on the internet through a search engine, and the embodiment of the invention provides the following specific steps:

101. when a knowledge graph is received, extracting triples from the knowledge graph through the analysis of the knowledge graph.

The triple is an element combination composed of an entity element, an attribute element associated with the entity element, and an attribute value element corresponding to the attribute element, such as: china, population-13 hundred million.

By parsing a knowledge graph, triples of multiple relationship types can be obtained, such as: for the embodiment of the present invention, a knowledge graph is analyzed, and an "entity-attribute value" triple is obtained from the knowledge graph, so as to check the attribute values of the attributes included in the triple.

102. And constructing the query problem corresponding to the triple by using a preset template.

The preset template is a template matched with the "entity-attribute value" triplet, and is a sentence pattern template pre-constructed based on the logical relationship among entities, attributes and attribute values in the triplet, such as "why XXX of XXX is XXX", or "XXX of XXX is XXX", and then the entity, attribute and attribute value contained in the triplet are added to the sentence pattern template, so as to construct the query question. For example, for the triple "chinese-population-13 hundred million", the corresponding query question may be "why the chinese population is 13 hundred million" or "the chinese population is actually 13 hundred million", and the like.

In the embodiment of the present invention, the function of constructing the query question corresponding to the triplet by using the preset template is as follows: and acquiring a sentence pattern containing entities, attributes and attribute elements of the triples, and searching whether corresponding search results exist on the Internet through a search engine by using the sentence pattern, wherein the search engine can be hundred-degree search, 360-degree search, Google search and the like.

It should be noted that, in a search box of a search engine, a search result corresponding to a keyword in which only an entity, an attribute, and an attribute value of a triple are input is not the same as a search result corresponding to a logical schema in which entities, attributes, and attribute values of a triple are input. Since the input schema is logical, the search results fed back by the search engine are also generally logical accordingly, such as: the input of "three-page height is actually 180", and accordingly, the obtained search result is usually data information related to the height of the discussed three-page, and compared with the input of sentence pattern in the search box of the search engine, better search result is obtained.

103. And searching the query problem on the Internet through a search engine to obtain a corresponding search result.

Wherein the search engine may be a hundred degree search, a 360 search, a google search, and so on.

In the embodiment of the invention, a plurality of search results can be obtained by inputting the query question corresponding to the triple in the search box of the search engine, and each search result usually comprises a title, a question abstract, answering information of net friends and the like.

104. And if the attribute value elements contained in the triples are determined to be wrong according to the search result, deleting the attribute value elements.

Wherein, the attribute value element contained in the triple is wrong, such as: the triple "china-currency-dollar", or alternatively, the triple "liudelwa-height-175", where 175 is wrong.

In the embodiment of the invention, since most of knowledge graphs disclosed to the public are constructed directly based on external data sources with high structured credibility, the data information disclosed on the network can be used for verifying whether the attribute values of the attributes of the knowledge graphs have errors or not. For the embodiment of the invention, after the search result corresponding to the query problem is obtained, whether the attribute value contained in the triple is wrong or not is determined according to the statistical analysis of the search result, if so, the attribute value is directly deleted, and the purpose of quickly cleaning the attribute value contained in the triple which is wrong is achieved. The method for checking whether the attribute values of the triples have errors can simply and conveniently find out the obvious attribute values contained in the triples by errors, so as to achieve the purposes of automatically checking the errors and automatically finishing cleaning.

The embodiment of the invention provides a method for cleaning knowledge graph attributes. The embodiment of the invention constructs a query problem for triples (namely, entities, attributes and attribute values) of a knowledge graph by using a preset template, searches the query problem on the Internet through a search engine and obtains a search result, and further automatically judges whether the attribute values contained in the triples are wrong or not by using the known data information containing the search result of the triples, so that the obvious attribute values can be automatically cleaned by mistake. Compared with the prior art, the problem that whether errors exist in knowledge map attributes or not through manual inspection, labor cost is consumed, and efficiency is low is solved. According to the embodiment of the invention, manual operation is replaced by automatic inspection, the inspection process of the attribute of the knowledge graph is optimized, the inspection efficiency is improved, and the attribute value error can be directly and automatically cleaned.

In order to explain the above embodiments in more detail, the embodiment of the present invention further provides another method for cleaning attributes of a knowledge graph, as shown in fig. 2, the method searches query questions corresponding to triples in a question-and-answer community of the internet by using different search engines, so that the question-and-answer community can have more effective search results and reduce the number of the obtained search results but is enough to achieve the purpose of searching, and further, whether the attributes of the triples are wrong is checked according to the search results fed back by the question-and-answer community, so as to finally improve the accuracy and efficiency of the checking, and the embodiment of the present invention provides the following specific steps:

201. when a knowledge graph is received, extracting triples from the knowledge graph through the analysis of the knowledge graph.

The triple is an element combination consisting of an entity element, an attribute element associated with the entity element and an attribute value element corresponding to the attribute element.

For the statement of this step, please refer to step 101, which is not described herein.

202. And constructing the query problem corresponding to the triple by using a preset template.

For the statement of this step, please refer to step 102, which is not described herein.

203. And searching the query questions in the question and answer community of the Internet through a search engine to obtain a corresponding search result.

The question-answer community refers to interactive question-answer platforms such as hundredth knowledge, 360 question-answer and web knowledge.

In the embodiment of the invention, the query questions can be searched on one or more question and answer communities through one search engine, or the query questions can be searched on one or more question and answer communities through a plurality of different search engines, which is suitable for the requirements of specific application scenarios.

204. And if the attribute value elements contained in the triples are determined to be wrong according to the search result, deleting the attribute value elements.

The specific steps of determining whether the attribute value elements included in the triples are incorrect according to the search result may be as follows:

firstly, counting the total number of search results corresponding to a query question on a question-answer community; secondly, crawling data information corresponding to the search result, judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in the data information corresponding to the search result, and if so, marking the search result as an effective search result; and finally, calculating the percentage of the number of the effective search results to the total number, and if the judgment percentage is smaller than a first preset threshold, determining that the attribute value elements contained in the triples are wrong.

For the above steps, it should be noted that, the above is to determine whether the attribute value element included in the triple is incorrect according to the search result fed back by only one question-answering community. Further, as a supplement to the specific step of determining whether the attribute value elements included in the triplets are incorrect according to the search results, in the embodiment of the present invention, the query questions corresponding to the triplets may be respectively searched in multiple question-answer communities to obtain richer and more diverse search results, so as to finally improve the accuracy of analyzing and determining whether the attribute values of the triplets are incorrect according to the search results. Specifically, the supplementing steps are as follows:

if the query questions corresponding to the triples are searched on the multiple question-answer communities respectively, and the percentage of the number of effective search results corresponding to the multiple question-answer communities to the total number is obtained, weighting processing is performed on the percentages corresponding to the multiple question-answer communities according to weights distributed to the multiple question-answer communities in advance, and if the percentage obtained after weighting processing is judged to be smaller than a first preset threshold value, it is determined that the attribute value elements contained in the triples are wrong.

It should be noted that, in the embodiment of the present invention, the access amount, the search result amount, and the like of one question-and-answer community are used, in short, weights are set for the question-and-answer community according to the attention and the credibility of the question-and-answer community, so as to measure the recognition degree of the above percentages obtained for different question-and-answer communities, and further, a comprehensive inspection result is made based on weighting processing, which is equivalent to synthesizing data information fed back by a plurality of different question-and-answer communities, so as to improve the accuracy of the inspection. Similarly, the further analysis shows that, for the embodiment of the present invention, the weights corresponding to different search engines may also be preset, and the weights corresponding to different question-answering communities may also be preset at the same time, so as to obtain a comprehensive examination result according to the two comprehensive weights, thereby better improving the accuracy of the examination.

Further, for an effective search result corresponding to the query problem provided in the embodiment of the present invention, it should be considered that an entity of a triple may further have a corresponding alias, so that the triple composed of the entity, the attribute, and the attribute value is identical to the triple composed of the entity alias, the attribute, and the attribute value, so in the embodiment of the present invention, it should also be found whether the attribute and the attribute value of the triple exist in the data information corresponding to the search result but the corresponding entity does not exist, if so, it is continuously determined whether a noun included in the data information corresponding to the search result is an alias of an entity element included in the triple, and if so, the search result is marked as an effective search result. Such as: the triple "China-population-13 hundred million" is equivalent to the triple "people's republic of China-population-13 hundred million".

Further, for the embodiment of the present invention, the obtained effective search results may be further filtered through the following method to reduce the number of effective search results, where the reduced effective search results do not affect whether the attribute values of the final check triplet are wrong check results, but such operation may reduce the number of compared search results instead, so as to finally improve the efficiency of obtaining the check results. The specific method can comprise the following steps: and acquiring answer information contained in the data information corresponding to the search result and approval information corresponding to the answer information, counting the total approval number corresponding to the search result according to the answer information and the approval information corresponding to the answer information, and deleting the search result from the effective search result if the total approval number is smaller than a second preset threshold value. It should be noted that the foregoing is equivalent to counting the attention of a search result by the number of answers corresponding to the search result and the number of praise corresponding to the number of answers, so that whether an error exists in the attribute value of the triplet is analyzed and checked by using the search result which is high in attention and effective, and the technical effect of the embodiment of the present invention can be achieved, thereby reducing unnecessary checking cost.

In the embodiment of the invention, if the attribute values contained in the check triplets are wrong, the attribute values are directly deleted to achieve the purpose of automatic cleaning, and the wrong attribute values can also be marked to clearly indicate that the attribute values are wrong, so that the interference on users is avoided, and manual correction is waited.

205. And judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in the header information corresponding to the effective search result.

206. And if so, marking the title corresponding to the effective search result as the query problem to be selected.

As for the above-mentioned step 205-206, in the embodiment of the present invention, in addition to constructing the query question by using the preset template, the header of the effective search result may be selected from the effective search results corresponding to the query question as the new query question, so as to increase the diversity of the query questions corresponding to the triples. Therefore, for the embodiment of the present invention, first, whether entities, attributes, and attribute values contained in triples exist in the header information of the effective search result is searched, and if so, the header can be used as a query to be selected.

207. And determining the sequencing sequence of the query problems to be selected according to the sequencing sequence of the effective search results displayed by the webpage.

208. And selecting a preset number of the query problems to be selected as the newly added query problems corresponding to the triples according to the sequence of the query problems to be selected.

For the above step 207-208, because the number of the obtained query questions to be selected is large, the preset number needs to be preferentially selected as the new query questions, the specific selection is based on the ranking of the effective search results on a question-answer community, and this is also based on the intelligent ranking of the question-answer community to complete the preferential screening executed in the embodiment of the present invention. The effect of the query question corresponding to the newly added triples is as follows: therefore, when the same triples of other knowledge maps are checked in the following process, corresponding search can be executed by integrating newly added query problems so as to increase the diversity of search results, and whether wrong check results exist in the attribute values contained in the triples is obtained by utilizing more diversified search result analysis, so that the accuracy of the check is finally improved.

Further, as an implementation of the methods shown in fig. 1 and fig. 2, an embodiment of the present invention provides a cleaning apparatus for knowledge graph attributes. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. As shown in fig. 3 in detail, the apparatus includes:

the extracting unit 31 is configured to, when a knowledge graph is received, extract a triplet from the knowledge graph through parsing of the knowledge graph, where the triplet is an element combination composed of an entity element, an attribute element associated with the entity element, and an attribute value element corresponding to the attribute element;

a constructing unit 32, configured to construct, by using a preset template, the query question corresponding to the triplet extracted by the extracting unit 31;

the searching unit 33 is used for searching the query question constructed by the constructing unit 32 on the internet through a search engine to obtain a corresponding search result;

a determining unit 34, configured to determine whether the attribute value element included in the triple is incorrect according to the search result searched by the searching unit 33;

a deleting unit 35, configured to delete the attribute value element when it is determined, according to the determining unit 34, that the attribute value element included in the triple is incorrect.

Further, as shown in fig. 4, the determining unit 34 includes:

a counting module 341, configured to count the total number of search results corresponding to the query question;

the crawling module 342 is configured to crawl data information corresponding to the search result;

a determining module 343, configured to determine whether entity elements, attribute elements, and attribute value elements included in the triples exist in the data information corresponding to the search result crawled by the crawling module 342;

a marking module 344, configured to mark the search result as an effective search result when the determining module 343 determines that the entity element, the attribute element, and the attribute value element included in the triple exist in the data information corresponding to the search result;

a calculating module 345, configured to calculate a percentage of the number of valid search results marked by the marking module 344 to the total number counted by the counting module;

a determining module 346, configured to determine whether the attribute value element included in the triple is incorrect according to the percentage of the number of valid search results calculated by the calculating module 345 to the total number;

further, as shown in fig. 4, if a search engine searches for a query question corresponding to the triplet, the determining module 346 further includes:

an obtaining sub-module 3461, configured to obtain a percentage of the number of valid search results corresponding to the search engine to the total number;

a determining sub-module 3462, configured to determine whether a percentage of the number of valid search results obtained by the obtaining sub-module to the total number is smaller than a first preset threshold;

the determining sub-module 3463 is configured to determine that the attribute value element included in the triple is incorrect when the determining sub-module determines that the percentage is smaller than the first preset threshold.

Further, as shown in fig. 4, if a plurality of search engines respectively search for the query question corresponding to the triple, the determining module 346 further includes:

the obtaining sub-module 3461 is further configured to obtain the percentage of the number of valid search results corresponding to the plurality of search engines to the total number;

a processing submodule 3464, configured to perform weighting processing on percentages, which correspond to the plurality of search engines, acquired by the acquisition submodule 3461 according to weights pre-assigned to the plurality of search engines, so as to obtain weighted percentages;

the determining sub-module 3462 is further configured to determine whether the weighted percentage obtained by the processing sub-module 3464 is smaller than a first preset threshold;

the determining sub-module 3463 is further configured to determine that the attribute value element included in the triple is incorrect when the determining sub-module 3462 determines that the weighted percentage is smaller than the first preset threshold.

Further, as shown in fig. 4, the determining unit 34 further includes:

before calculating the percentage of the effective number of search terms to the total number of search terms, if the attribute elements and the attribute value elements included in the triples are found in the data information corresponding to the search result and the entity elements included in the triples are not found, extracting the nouns included in the data information corresponding to the search result;

the determining module 343 is further configured to determine whether the noun extracted by the extracting module 347 is an alias of an entity element included in the triple;

the marking module 344 is further configured to mark the search result as a valid search result when the determining module 343 determines that the noun is an alias of an entity element included in the triple.

Further, as shown in fig. 4, the determining unit 34 further includes:

an obtaining module 348, configured to obtain answer information included in data information corresponding to the search result and approval information corresponding to the answer information after the search result is marked as a valid search result;

the counting module 341 is further configured to count a total number of praise corresponding to the search result according to the answer information and the praise information corresponding to the answer information;

a deleting module 349, configured to delete the search result from the valid search results if the total number of praise counted by the counting module 341 is smaller than a second preset threshold.

Further, as shown in fig. 4, the apparatus further includes:

a determining unit 36, configured to determine whether an entity element, an attribute element, and an attribute value element included in the triple exist in the header information corresponding to the effective search result;

a marking unit 37, configured to mark, when the determining unit 36 determines that the header information corresponding to the effective search result includes the entity element, the attribute element, and the attribute value element included in the triple, the header corresponding to the effective search result as a query problem to be selected;

the determining unit 34 is further configured to determine the order of the query questions to be selected according to the order of the effective search results displayed by the web page;

the determining unit 34 is further configured to select a preset number of the query problems to be selected as the newly added query problems corresponding to the triplet according to the order of the query problems to be selected.

Since the cleaning device for the attribute of the knowledge graph described in this embodiment is a device that can perform the cleaning method for the attribute of the knowledge graph in the embodiment of the present invention, based on the cleaning method for the attribute of the knowledge graph described in the embodiment of the present invention, a person skilled in the art can understand the specific implementation manner of the cleaning device for the attribute of the knowledge graph in this embodiment and various variations thereof, and therefore, how to implement the cleaning method for the attribute of the knowledge graph by the cleaning device for the attribute of the knowledge graph in the embodiment of the present invention is not described in detail here. The scope of the present application is intended to cover any apparatus used by those skilled in the art to implement the method for cleaning knowledge-graph attributes in the embodiments of the present invention.

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, including: at least one processor (processor) 41; and at least one memory (memory)42, a bus 43 connected to the processor 41; wherein the content of the first and second substances,

the processor 41 and the memory 42 complete mutual communication through the bus 43;

the processor 41 is configured to call program instructions in the memory 42 to perform the steps in the above-described method embodiments.

Embodiments of the present invention also provide a non-transitory computer-readable storage medium, which stores computer instructions, where the computer instructions cause the computer to execute the methods provided by the above method embodiments.

In summary, the embodiments of the present invention provide a method and an apparatus for cleaning an attribute of a knowledge graph. The embodiment of the invention constructs a query problem for triples (namely, entities, attributes and attribute values) of a knowledge graph by using a preset template, searches the query problem on the Internet through a search engine and obtains a search result, and further automatically judges whether the attribute values contained in the triples are wrong or not by using the known data information containing the search result of the triples, so that the obvious attribute values can be automatically cleaned by mistake. Compared with the prior art, the problem that whether errors exist in knowledge map attributes or not through manual inspection, labor cost is consumed, and efficiency is low is solved. According to the embodiment of the invention, manual operation is replaced by automatic inspection, the inspection process of the attribute of the knowledge graph is optimized, the inspection efficiency is improved, and the attribute value error can be directly and automatically cleaned. In addition, after one knowledge graph attribute is checked, the query problem corresponding to the triple can be added according to the search result corresponding to the check, so that when the same triple of other knowledge graphs is checked in the subsequent process, the corresponding search can be executed by integrating the newly added query problem, the diversity of the search result is increased, and then whether the wrong check result exists in the attribute value contained in the triple is obtained by analyzing the more diversified search result, and the accuracy of the check is finally improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for cleaning knowledge-graph attributes, the method comprising:

constructing a query question corresponding to the triple by using a preset template, wherein the query question is a sentence pattern with logic and containing entity elements, attribute elements and attribute values of the triple, and is used for searching by using the query question to obtain a search result with the logic sentence pattern, and the preset template is a sentence pattern template constructed in advance based on the logic relation among the entities, the attributes and the attribute values in the triple, and is used for adding the entities, the attributes and the attribute values contained in the triple to the sentence pattern template to directly construct the query question;

if the attribute value elements contained in the triples are determined to be wrong according to the search result, deleting the attribute value elements;

wherein the determining that the attribute value element included in the triple is incorrect according to the search result includes: counting the total number of the search results corresponding to the query question; crawling data information corresponding to the search result; judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in the data information corresponding to the search result; if so, marking the search result as a valid search result; calculating a percentage of the number of valid search results to the total number; determining whether the attribute value elements contained in the triples are erroneous according to the percentage of the number of the valid search results to the total number;

judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in the header information corresponding to the effective search results; if so, marking the title corresponding to the effective search result as a query problem to be selected; determining the ordering sequence of the query problems to be selected according to the ordering sequence of the effective search results displayed by the webpage; and selecting a preset number of the query problems to be selected as the newly added query problems corresponding to the triples according to the sequence of the query problems to be selected.

2. The method of claim 1, wherein if a search engine searches for the query question corresponding to the triple, the determining whether the attribute value element included in the triple is incorrect according to the percentage of the number of the valid search results to the total number comprises:

3. The method of claim 1, wherein if the query question corresponding to the triple is searched by a plurality of search engines respectively, the determining whether the attribute value element included in the triple is incorrect according to the percentage of the number of the valid search results to the total number comprises:

4. The method of claim 1, wherein prior to calculating the percentage of the number of valid search bars to the total number of bars, the method further comprises:

and if so, marking the search result as a valid search result.

5. The method of claim 1, wherein after marking the search result as a valid search result, the method further comprises:

6. The method of any of claims 1-5, wherein the searching the query question over the internet via a search engine, further comprises:

7. A cleaning device for knowledge-graph attributes, the device comprising:

the construction unit is used for constructing the query question corresponding to the triple extracted by the extraction unit by using a preset template, wherein the query statement is a sentence pattern with logic and containing entity elements, attribute elements and attribute values of the triple, the query statement is used for searching by using the query question to obtain a search result with the logic sentence pattern, and the preset template is a sentence pattern template constructed in advance based on the logic relation among the entities, the attributes and the attribute values in the triple, and is used for adding the entities, the attributes and the attribute values contained in the triple to the sentence pattern template to directly construct the query question;

a deleting unit, configured to delete the attribute value element when the determining unit determines that the attribute value element included in the triple is incorrect;

wherein the determination unit includes: the statistic module is used for counting the total number of the search results corresponding to the query question; the crawling module is used for crawling data information corresponding to the search result; the judging module is used for judging whether entity elements, attribute elements and attribute value elements contained in the triples exist in data information corresponding to the search results crawled by the crawling module; the marking module is used for marking the search result as an effective search result when the judging module judges that the entity elements, the attribute elements and the attribute value elements contained in the triples exist in the data information corresponding to the search result; the calculating module is used for calculating the percentage of the number of the effective search results marked by the marking module to the total number counted by the counting module; a determining module, configured to determine whether the attribute value elements included in the triple are incorrect according to the percentage of the number of valid search results calculated by the calculating module to the total number;

8. The apparatus of claim 7, wherein the determining module comprises:

9. The apparatus of claim 8, wherein the determining module comprises:

the obtaining submodule is further configured to obtain the percentage of the number of valid search results corresponding to each of the plurality of search engines to the total number;

the judgment submodule is also used for judging whether the weighted percentage obtained by the processing submodule is smaller than a first preset threshold value;

the determining submodule is further configured to determine that the attribute value element included in the triple is incorrect when the determining submodule determines that the percentage after the weighting processing is smaller than a first preset threshold.

10. The apparatus of claim 7, wherein the determining unit further comprises:

11. The apparatus of claim 7, wherein the determining unit further comprises:

12. An electronic device, comprising:

at least one processor;

the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the method of cleaning of knowledge-graph attributes of any one of claims 1 to 6.

13. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of cleaning a knowledge-graph attribute of any one of claims 1 to 6.