CN110347798B

CN110347798B - Knowledge graph auxiliary understanding system based on natural language generation technology

Info

Publication number: CN110347798B
Application number: CN201910629843.0A
Authority: CN
Inventors: 李劲松; 吕可伟; 尚勇; 周天舒
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2021-06-01
Anticipated expiration: 2039-07-12
Also published as: JP7064262B2; WO2020233261A1; JP2022510031A; CN110347798A

Abstract

The invention discloses a knowledge graph auxiliary understanding system based on natural language generation technology, which comprises a knowledge graph selection module, a knowledge graph translation module and a result display module; the knowledge graph is converted into the natural language text by using the natural language generation technology, so that a domain expert can accurately, deeply and comprehensively know the knowledge graph in the domain before using the knowledge graph on the basis that the domain expert does not know the source code and software of the knowledge graph. Meanwhile, each short sentence is associated with the source code corresponding to the knowledge graph, if redundant and wrong information existing in the knowledge graph is found, the knowledge graph can be corrected in time, and the method is high in universality. The invention further accelerates the understanding of domain experts on the knowledge graph by using a visualization method.

Description

Knowledge graph auxiliary understanding system based on natural language generation technology

Technical Field

The invention relates to the technical field of knowledge graphs, in particular to a knowledge graph auxiliary understanding system based on a natural language generation technology.

Background

The knowledge graph is a semantic knowledge base, usually adopts a subject-predicate-object triple form to represent a knowledge point, and compared with the body which has strict requirements on logic and semantics, the knowledge graph emphasizes weak semantics and weak logic, so that the knowledge graph is well popularized in academic circles and industrial circles, and large internet companies including google begin to research the knowledge graph to improve the search quality. Reports according to 2014 show that the current knowledge graph of google has gathered over 16 billion facts, with 2.71 million facts being considered to be over 90% true. In the Google search of 5 months in 2016, the knowledge-graph answers about one third of the questions in 1000 hundred million searches of the month.

Natural language generation technology is one of the large technologies of natural language processing technology. Unlike natural language understanding, natural language generation techniques focus on how a computer expresses a given meaning, idea, etc. in natural language text. For the knowledge graph, especially for the knowledge graph in a specific field, the accuracy requirement of the knowledge graph for practical application is very high, for example, the quality of the knowledge graph of medical related knowledge graph is seriously related to the accuracy of the whole system. However, the programming Language for constructing the knowledge graph is the same as the Ontology, and mainly includes RDF (Resource Description Framework) and OWL (Web Ontology Language), and the adopted software is mainly Prot g developed by stanford university. These languages and software are highly specialized and it is difficult for irrelevant persons to understand their specific meanings without long-term learning and training. Meanwhile, knowledge points stored by OWL and RDF are unordered, and knowledge points related to the same content are stored in different parts of a program, so that the difficulty of directly understanding the source code of the knowledge map by field experts is further increased. The knowledge graph is mostly established by computer industry actors, but users are scholars and experts in the field related to the knowledge graph content, and the mismatching of the scholars and the experts results in that the domain experts cannot understand the knowledge graph content, and the knowledge graph can be further improved only through the use of the knowledge graph, but cannot visually understand and improve the knowledge graph content in advance. This indirectly leads to instability in the quality of the knowledge-graph and severity of the phenomenon of secondary knowledge-graph development of the same content. In 2017, students randomly drawn 200 biomedical related ontologies in the national biomedical ontologies center of the United states, and found that only 17 of them were officially evaluated by experts in their corresponding design documents.

Knowledge maps in many fields need domain experts to deeply and comprehensively know the contents of the representation before use so as to ensure the accuracy of the representation in the actual use process. However, the related languages and software of the knowledge map are highly professional, knowledge points with the same theme are scattered, and domain experts are difficult to master and understand the knowledge points in a short time. At present, most of software for assisting understanding of the knowledge graph presents the association of different knowledge nodes in a visual mode through searching, so that presented knowledge is local knowledge and does not relate to the knowledge graph. Meanwhile, the methods discover the problems existing in the knowledge graph in the using process without fully knowing and evaluating the knowledge graph before the knowledge graph is used.

Disclosure of Invention

The invention aims to provide a knowledge graph auxiliary understanding system based on a natural language generation technology on the basis of insufficient knowledge graph quality control and difficulty in understanding of knowledge graphs related to fields by field experts.

The invention is realized by the following technical scheme: a knowledge graph auxiliary understanding system based on natural language generation technology comprises a knowledge graph selection module, a knowledge graph translation module and a result display module;

the knowledge map selection module is used for acquiring a target knowledge map which accords with RDF or OWL grammar specifications;

the knowledge-graph translation module: firstly, extracting a triple of a target knowledge graph, and performing character string segmentation on the extracted triple to obtain three dynamic arrays: the subject array, the predicate array and the object array have one-to-one correspondence, and then the subject, the predicate and the object are assembled by using a Simplelg tool through nested circulation to form a complete short sentence; simultaneously, regarding the relationship of the subject-predicate-object, one-to-many and many-to-many, adding special characters in a predicate array and an object array for identification so as to determine that the predicate corresponds to a subject and the object corresponds to a subject and a predicate, then judging the special characters in a nested loop so as to determine the corresponding relationship of the subject, the predicate and the object, and assembling the corresponding subject, the predicate and the object by using a Simplelng tool to form a complete long sentence; the triple corresponding to the annotation part is not formed into a sentence independently, but is used as annotation information for supplementing other sentences; and then translating the target knowledge graph into short sentences and long sentences, storing the sentences into a local database (which can adopt a MySQL database) after further specification, and selecting contents of the relation between the classes and the subclasses and between the classes and the instances from three dynamic arrays of subjects, predicates and objects to assemble the files in a JSON format.

The result display module calls translation contents (namely short sentences and long sentences) of the target knowledge graph from a local database, displays the translation contents and source files (RDF (resource description framework) and OWL (network ontology language)) of the target knowledge graph together, obtains JSON (Java Server object notation) format files at the same time, draws a tree graph through a visualization tool (D3 can be adopted), and visually displays classes and subclasses in the knowledge graph and the hierarchical structure of the classes and examples.

Further, the method for acquiring the target knowledge graph by the knowledge graph selection module comprises two ways:

the first way is as follows: the method comprises the steps that a knowledge graph which accords with RDF (resource description language) or OWL (Ontology Web language) grammar specifications is crawled from an open source knowledge graph database (when the system applies knowledge graph assistance understanding in the Biomedical field, the open source knowledge graph database can select a National Biomedical Ontology Center (NCBO)), the crawled knowledge graph is translated through a knowledge graph translation module, and translation results are stored in a local database; when the system is used for searching the knowledge graph of a certain theme, the input name and the English name of the knowledge graph are subjected to similarity calculation, and the input name and the English name are sorted from large to small according to the similarity to obtain a target knowledge graph to be selected;

and (2) a second way: and uploading the knowledge graph which accords with the RDF or OWL grammar specification by the user to be used as the target knowledge graph.

Further, in the first approach for obtaining the target knowledge graph, the similarity judgment coefficient is a Jaccard similarity coefficient (Jaccard coefficient), which is commonly used for comparing similarity and difference between limited sample sets, and the larger the Jaccard coefficient value is, the higher the sample similarity is.

Record the concept set of user input names as C₁The concept set of English name of the knowledge graph is marked as C₂The Jaccard similarity coefficient J (C) between the two₁，C₂) Comprises the following steps:

if C1 and C2 are identical, then J (C)₁，C₂) A value of 1; and sequencing the search results according to the similarity, and presenting N results with higher similarity, wherein N is defined by the user.

Further, the steps of extracting the triples of the target knowledge graph in the knowledge graph translation module are as follows: the method comprises the steps of extracting subjects, predicates and objects corresponding to all knowledge points (classes, examples, object attributes, data attributes, annotations and the like) in a target knowledge graph by using SPARQL (SPARQL Protocol and RDF Query Language), and encoding the subjects, predicates and objects into triples (RDF triples) of a resource description framework.

Further, the short sentence generating step of the target knowledge graph in the knowledge graph translation module is specifically as follows: firstly, character string segmentation is carried out on the obtained triples, names of subjects, predicates and objects are obtained, and three dynamic arrays are constructed. In the short sentence generation, since the subject, predicate, and object relationships are one-to-one relationships, the subject, predicate, and object corresponding thereto may be directly assembled into a short sentence using simplelg by a nested loop.

Further, the steps of generating the long sentence of the target knowledge graph in the knowledge graph translation module are as follows: firstly, character string segmentation is carried out on the obtained triples, names of subjects, predicates and objects are obtained, and three dynamic arrays are constructed. In long sentence generation, considering that one subject can correspond to a plurality of predicates, and each predicate can correspond to a plurality of objects, in a predicate array, predicates corresponding to different subjects are marked by special identifiers; in the object array, objects of different predicates corresponding to different subjects are marked by adopting another special identifier, so that the one-to-one correspondence relationship among the subjects, the predicates and the objects is realized, then the special identifiers are judged by adopting a nested loop, and the corresponding subjects, the predicates and the objects are assembled by using Simplenlg. Wherein, different predicates of the same subject form a sentence, all sentences of the same subject form a paragraph, and different objects are connected by connecting words (and/or).

Further, the annotation information steps of the supplementary sentence of the target knowledge graph in the knowledge graph translation module are as follows: the predicate array is first cycled through, and if the predicate is "comment" (meaning that the object is the subject of the annotation), the corresponding subject and object are extracted to form a new dynamic array-annotation array, where the odd-subscripted array elements store the subject and the even-subscripted array elements store the object. And then, carrying out nested loop of the subject array, the predicate array and the object array, judging whether the subject and the object are in the annotation array, if so, adding brackets behind the subject or the object, and if the subject or the object exists, annotating the subject or the object in the brackets, then judging the predicate, and if the predicate is not "comment", assembling, otherwise, not assembling.

Further, the step of inserting the short sentence and the long sentence of the target knowledge graph into the database in the knowledge graph translation module is specifically as follows: the method comprises the steps of utilizing JDBC (Java DataBase connectivity) API to connect databases, firstly creating a DataBase and a data table for storing translation results, defining table names, table fields, confirming main keys and the like, then matching English names of knowledge maps with names stored in the DataBase, if the translation results of the knowledge maps exist in the knowledge base, not performing insertion operation, and if the translation results do not exist in the knowledge base, adding generated short sentence arrays and long sentence arrays into the data table.

Further, the specific steps of the translation content and the source file display in the result display module are as follows: after a target knowledge graph is selected in a webpage interface, all translation contents corresponding to the knowledge graph are called from a database by using ajax and displayed on the interface, and a source file of the target knowledge graph is read from a local server and displayed in the interface together.

Further, the specific steps of the visual display in the result display module are as follows: after a target knowledge graph is selected in a webpage interface, a JSON format file corresponding to the rear end is obtained by using ajax, and a tree diagram is drawn; in the tree diagram, each node represents a subject or an object, and each node is connected with other associated nodes through connecting lines.

The invention has the beneficial effects that: the knowledge graph is converted into the natural language text by using the natural language generation technology, so that a domain expert can accurately, deeply and comprehensively know the knowledge graph in the domain before using the knowledge graph on the basis that the domain expert does not know the source code and software of the knowledge graph. Meanwhile, each short sentence is associated with the source code corresponding to the knowledge graph, if redundant and wrong information existing in the knowledge graph is found, the knowledge graph can be corrected in time, and the method is high in universality. The invention further accelerates the understanding of domain experts on the knowledge graph by using a visualization method.

Drawings

FIG. 1 is a block diagram of a knowledge-graph aided understanding system based on natural language generation technology according to the present invention;

FIG. 2 is a flow chart of an implementation of the knowledge-graph aided understanding system based on natural language generation technology according to the present invention;

FIG. 3 is a flow diagram of natural language generation by the knowledge-graph translation module of the present invention;

FIG. 4 is a schematic diagram of a portion of source code for a knowledge-graph;

FIG. 5 is a diagram of a phrase generated using natural language techniques;

FIG. 6 is a diagram of a long sentence generated using natural language techniques;

FIG. 7 is a tree diagram of classes and subclasses.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples.

As shown in fig. 1 and 2, the knowledge graph aided understanding system based on the natural language generation technology provided by the invention comprises a knowledge graph selection module, a knowledge graph translation module and a result display module;

knowledge graph selection module

The knowledge map selection module is used for acquiring a target knowledge map which accords with RDF or OWL grammar specifications; the method for obtaining the target knowledge graph comprises two ways:

the similarity judgment coefficient can adopt a Jaccard similarity coefficient (Jaccard coefficient) which is commonly used for comparing similarity and difference between limited sample sets, wherein the larger the Jaccard coefficient value is, the higher the sample similarity is.

if C1 and C2 are identical, then J (C)₁，C₂) A value of 1; and sequencing the search results according to the similarity, and presenting N results with higher similarity, wherein N is user-defined and can be set to be 15.

Second, knowledge map translation module

As shown in fig. 3, a specific flow is to extract a triplet of a target knowledge graph, and perform string segmentation on the extracted triplet to obtain three dynamic arrays: the subject array, the predicate array and the object array have one-to-one correspondence, and then the subject, the predicate and the object are assembled by using a Simplelg tool through nested circulation to form a complete short sentence; simultaneously, regarding the relationship of the subject-predicate-object, one-to-many and many-to-many, adding special characters in a predicate array and an object array for identification so as to determine that the predicate corresponds to a subject and the object corresponds to a subject and a predicate, then judging the special characters in a nested loop so as to determine the corresponding relationship of the subject, the predicate and the object, and assembling the corresponding subject, the predicate and the object by using a Simplelng tool to form a complete long sentence; the triple corresponding to the annotation part is not formed into a sentence independently, but is used as annotation information for supplementing other sentences; the target knowledge graph is then translated into short and long sentences, and the simultaneously generated sentences need further specification, such as capital English letters at the beginning of the sentences, hyperlink added to part names and the like. And inserting the normalized sentences into a local database, and selecting the contents of the class and the subclass and the class and instance relation from three dynamic arrays of the subject, the predicate and the object to assemble the files in the JSON format. The local database can adopt MySQL database, MySQL is a popular open-source relational database management system at present, and the MySQL database can store data in different tables instead of putting all data in a warehouse, thus increasing the speed.

The steps of extracting the triples of the target knowledge graph are as follows: the method comprises the steps of extracting subjects, predicates and objects corresponding to all knowledge points (classes, examples, object attributes, data attributes, annotations and the like) in a target knowledge graph by using SPARQL (SPARQL Protocol and RDF Query Language), and encoding the subjects, predicates and objects into triples (RDF triples) of a resource description framework.

The short sentence generating step of the target knowledge graph comprises the following specific steps: firstly, character string segmentation is carried out on the obtained triples, names of subjects, predicates and objects are obtained, and three dynamic arrays are constructed. In the short sentence generation, since the subject, predicate, and object relationships are one-to-one relationships, the subject, predicate, and object corresponding thereto may be directly assembled into a short sentence using simplelg by a nested loop.

The steps of generating the long sentence of the target knowledge graph are as follows: firstly, character string segmentation is carried out on the obtained triples, names of subjects, predicates and objects are obtained, and three dynamic arrays are constructed. In long sentence generation, considering that one subject can correspond to a plurality of predicates, and each predicate can correspond to a plurality of objects, in a predicate array, predicates corresponding to different subjects are marked by special identifiers; in the object array, objects of different predicates corresponding to different subjects are marked by adopting another special identifier, so that the one-to-one correspondence relationship among the subjects, the predicates and the objects is realized, then the special identifiers are judged by adopting a nested loop, and the corresponding subjects, the predicates and the objects are assembled by using Simplenlg. Wherein, different predicates of the same subject form a sentence, all sentences of the same subject form a paragraph, and different objects are connected by connecting words (and/or).

The annotation information steps of the supplementary sentences of the target knowledge graph are as follows: the predicate array is first cycled through, and if the predicate is "comment" (meaning that the object is the subject of the annotation), the corresponding subject and object are extracted to form a new dynamic array-annotation array, where the odd-subscripted array elements store the subject and the even-subscripted array elements store the object. And then, carrying out nested loop of the subject array, the predicate array and the object array, judging whether the subject and the object are in the annotation array, if so, adding brackets behind the subject or the object, and if the subject or the object exists, annotating the subject or the object in the brackets, then judging the predicate, and if the predicate is not "comment", assembling, otherwise, not assembling.

The steps of inserting the short sentences and the long sentences of the target knowledge graph into the database are as follows: the method comprises the steps of utilizing JDBC (Java DataBase connectivity) API to realize connection between Java and a DataBase, firstly creating the DataBase and a data table for storing translation results, defining table names, table fields, confirming main keys and the like, then matching English names of a knowledge graph with names stored in the DataBase, if the translation results of the knowledge graph exist in the knowledge base, not performing insertion operation, and if the translation results do not exist in the knowledge base, adding a generated short sentence array and a generated long sentence array into the data table.

Third, result display module

The results are shown to be divided into three parts. When a target knowledge graph is selected at a webpage end or uploaded at a website, the file or the parameters are submitted to a back end through ajax, after the file is transmitted to the back end, a source code of the file is displayed on the webpage and natural language generation is automatically carried out, a generated result is inserted into a database, and then related contents are read from the database and displayed at the webpage end. Meanwhile, the system selects the contents of the relation between the class and the subclass and between the class and the instance from the three dynamic arrays of the subject, the predicate and the object, assembles the contents into a file in a JSON format, transmits the file to the front end, and utilizes a visualization tool D3 to draw a tree graph and display the main hierarchical structure of the tree graph. Taking a knowledge map of chronic kidney disease disclosed by the American biomedical ontologies center as an example, the operation results are shown in FIGS. 4-7, and FIG. 7 shows a part of the content of a dendrogram.

By using the system of the invention, after the target knowledge graph is uploaded to a website or the knowledge graph in a library is selected on the website, the system can automatically inquire related contents in the knowledge graph, divide character strings, translate RDF triples into short sentences and long sentences, further standardize sentence patterns, and finally display the generated text to a domain expert, wherein each sentence corresponds to the source code of the knowledge graph. Meanwhile, the system presents important classes and subclasses, and class and instance relations in the knowledge graph in the form of a tree diagram, and helps experts to quickly understand and master the content and information of the knowledge graph so as to control the quality in a short time.

The above are merely examples of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement and the like, which are not made by the inventive work, are included in the scope of protection of the present invention within the spirit and principle of the present invention.

Claims

1. A knowledge graph auxiliary understanding system based on natural language generation technology is characterized by comprising a knowledge graph selection module, a knowledge graph translation module and a result display module;

the knowledge-graph translation module: firstly, extracting a triple of a target knowledge graph, and performing character string segmentation on the extracted triple to obtain three dynamic arrays: the subject array, the predicate array and the object array have one-to-one correspondence, and then the subject, the predicate and the object are assembled by using a Simplelg tool through nested circulation to form a complete short sentence; simultaneously, regarding the relationship of the subject-predicate-object, one-to-many and many-to-many, adding special characters in a predicate array and an object array for identification so as to determine that the predicate corresponds to a subject and the object corresponds to a subject and a predicate, then judging the special characters in a nested loop so as to determine the corresponding relationship of the subject, the predicate and the object, and assembling the corresponding subject, the predicate and the object by using a Simplelng tool to form a complete long sentence; the triple corresponding to the annotation part is not formed into a sentence independently, but is used as annotation information for supplementing other sentences; then translating the target knowledge graph into short sentences and long sentences, storing the sentences into a local database after further specification, and selecting contents of the relation between the class and the subclass and between the class and the instance from three dynamic arrays of subjects, predicates and objects to assemble the files into JSON (Java Server pages open) format files;

the result display module calls the translation content of the target knowledge graph from the local database, displays the translation content and the source file of the target knowledge graph together, obtains a JSON format file at the same time, draws a tree graph through a visualization tool, and visually displays the class and the subclass in the knowledge graph and the hierarchical structure of the class and the example.

2. The system of claim 1, wherein the knowledge-graph selection module obtains the target knowledge-graph in two ways:

the first way is as follows: crawling a knowledge graph which accords with RDF or OWL grammar specifications from an open source knowledge graph database, translating the crawled knowledge graph through a knowledge graph translation module, and storing a translation result into a local database; when the system is used for searching the knowledge graph of a certain theme, the input name and the English name of the knowledge graph are subjected to similarity calculation, and the input name and the English name are sorted from large to small according to the similarity to obtain a target knowledge graph to be selected;

3. The system of claim 2, wherein in a first approach to obtaining a target knowledge graph, the similarity determination coefficient is a Jaccard similarity coefficient;

record the concept set of user input names as C₁The concept set of English name of the knowledge graph is marked as C₂The Jaccard similarity coefficient J (C) between the two₁,C₂) Comprises the following steps:

if C1 and C2 are identical, then J (C)₁,C₂) A value of 1; and sorting the search results according to the similarity degree.

4. The system of claim 1, wherein the steps of extracting the triples of the target knowledge-graph in the knowledge-graph translation module are as follows: and extracting subjects, predicates and objects corresponding to all knowledge points in the target knowledge graph by using the SPARQL, and encoding the subjects, predicates and objects into triples of a resource description framework, wherein the knowledge points comprise classes, instances, object attributes, data attributes and annotations.

5. The system of claim 1, wherein the steps of generating the short sentence of the target knowledge graph in the knowledge graph translation module are as follows: firstly, carrying out character string segmentation on the obtained triples to obtain names of a subject, a predicate and an object, and constructing three dynamic arrays; in the short sentence generation, since the subject, predicate, and object relationships are one-to-one relationships, the subject, predicate, and object corresponding thereto may be directly assembled into a short sentence using simplelg by a nested loop.

6. The system of claim 1, wherein the generation of the long sentence of the target knowledge graph in the knowledge graph translation module specifically comprises the following steps: firstly, carrying out character string segmentation on the obtained triples to obtain names of a subject, a predicate and an object, and constructing three dynamic arrays; in long sentence generation, considering that one subject can correspond to a plurality of predicates, and each predicate can correspond to a plurality of objects, in a predicate array, predicates corresponding to different subjects are marked by special identifiers; in the object array, marking objects of different predicates corresponding to different subjects by adopting another special identifier to realize the one-to-one correspondence of the subjects, the predicates and the objects, judging the special identifiers by adopting a nested loop, and assembling the corresponding subjects, predicates and objects by using Simplelng; wherein, different predicates of the same subject form a sentence, all sentences of the same subject form a paragraph, and different objects are connected by connecting words.

7. The system of claim 1, wherein the steps of annotating the sentences supplemented by the target knowledge graph in the knowledge graph translation module are as follows: firstly, circulating a predicate array, if the predicate is "comment", namely, the predicate represents that the object is a comment of the subject, extracting the corresponding subject and the object to form a new dynamic array-comment array, wherein the odd subscript array elements store the subject, and the even subscript array elements store the object; and then, carrying out nested loop of the subject array, the predicate array and the object array, judging whether the subject and the object are in the annotation array, if so, adding brackets behind the subject or the object, and if the subject or the object exists, annotating the subject or the object in the brackets, then judging the predicate, and if the predicate is not "comment", assembling, otherwise, not assembling.

8. The system of claim 1, wherein the steps of inserting short sentences and long sentences of the target knowledge graph into the database in the knowledge graph translation module are as follows: the JDBC API is used for connecting the databases, firstly, the database and the data table for storing the translation result are created, the table name, the table field and the confirmation main key are defined, then the English name of the knowledge map is matched with the name stored in the database, if the translation result of the knowledge map exists in the local database, the insertion operation is not carried out, and if the translation result of the knowledge map does not exist in the local database, the generated short sentence array and the generated long sentence array are added into the data table.

9. The system of claim 1, wherein the specific steps of translating the content and displaying the source file in the result displaying module are as follows: after a target knowledge graph is selected in a webpage interface, all translation contents corresponding to the knowledge graph are called from a database by using ajax and displayed on the interface, and a source file of the target knowledge graph is read from a local server and displayed in the interface together.

10. The system of claim 1, wherein the visualization of the result presentation module comprises the following steps: after a target knowledge graph is selected in a webpage interface, a JSON format file corresponding to the rear end is obtained by using ajax, and a tree diagram is drawn; in the tree diagram, each node represents a subject or an object, and each node is connected with other associated nodes through connecting lines.