CN112015915A

CN112015915A - Question-answering system and device based on knowledge base generated by questions

Info

Publication number: CN112015915A
Application number: CN202010902568.8A
Authority: CN
Inventors: 车万翔; 乔振浩; 赵妍妍; 刘挺
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2020-12-01

Abstract

A knowledge base question-answering system and device based on question generation relates to an automatic question-answering system. The problem that a person with professional knowledge is required to label a special data set based on a knowledge graph question-answering method, so that the labeling cost is high, the workload is large, and the time consumption is long is solved. The template database of the system is used for storing the templates; the triple expansion module reads in the triple, analyzes the triple and selects all templates under the relation from the template library; replacing the entity with the corresponding symbol of the triple in the template to generate a sentence; the full-text retrieval module divides a Query text queried by a user, converts a Query sentence divided into words into a Lucene internal representation Query object, and retrieves a group of sentences related to the Query of the user as a candidate set; a semantic matching module: and (4) sequencing the candidate set by adopting a semantic matching network based on a pre-training model Bert, and taking the triple corresponding to the highest score as an answer to return to the user. The method is mainly used for realizing automatic question answering.

Description

Question-answering system and device based on knowledge base generated by questions

Technical Field

The invention belongs to the technical field of computers, and particularly relates to an automatic question answering system.

Background

With the development of science and technology, automatic answering robots, systems or voice assistants are rapidly developed. Knowledge base question-answering system based on knowledge map can satisfy the demand that people acquire knowledge fast because can answer fact class problem directly, receives the attention of academic and industrial world more and more.

The knowledge-graph is data organized in a triplet format, such as < yaoming > < nationality > < china >, where yaoming and china are entity 1 and entity 2 and nationality is the relationship between the two entities. The input to such a question-and-answer system is a text query q, then one or a set of triples most relevant to the query is found in the knowledge base, and the corresponding entities in the triples are returned. At present, the mainstream methods include a method based on relationship classification, a method based on search and a method based on semantic analysis. Taking a method based on relational classification as an example, the method firstly predicts an entity and a relation from a question, and then finds an answer entity according to the two. The common characteristic of these methods is that the prediction model needs to be trained by question and its corresponding logical expression data. Compared with the construction of knowledge maps, the integration cost of the special data for labeling is higher, and a labeler needs to master certain professional knowledge including field professional knowledge and query language knowledge. The problem of high construction cost of the data set causes that the application of the knowledge base question-answering system based on the knowledge graph is limited, and the knowledge graph cannot be effectively utilized to construct the question-answering system under the condition of lacking of training data.

Disclosure of Invention

The invention aims to solve the problems of high labeling cost, large workload and long consumed time due to the fact that a knowledge-graph-based question-answering method needs a person with professional knowledge to label a special data set.

A knowledge base question-answering system based on question generation, comprising:

a template database: storing the template; the template is a text file in json format for expanding triple semantic information and written aiming at a knowledge base use scene;

the triple extension module: reading in the triples and analyzing the triples into the forms of an entity 1, a relationship and an entity 2; then all templates under the relation are selected from the template library; replacing the corresponding symbols of the triples in the template with the entities 1 and 2 to generate sentences;

a full text retrieval module: firstly, segmenting a Query text queried by a user by using a word segmentation tool, and then converting a Query statement segmented into words into a Query object inside Lucene by using a QueryParser class; finally, a group of sentences relevant to the user query is retrieved through an IndexSearcher interface provided by Lucene to serve as a candidate set;

a semantic matching module: sorting the candidate sets by adopting a semantic matching network based on a pre-training model Bert; inputting a query text and a candidate set for retrieval queried by a user in an adopted semantic matching model; obtaining a semantic matching score from an output vector corresponding to the classification label through a SoftMax layer; after all candidate set texts are calculated through a semantic matching model to obtain scores, the scores are ranked from high to low, and the triples corresponding to the highest scores are taken as answers and returned to the user.

Furthermore, the full-text retrieval module realizes the text query process based on the full-text index, when the full-text index is constructed, a word segmentation tool is firstly utilized to carry out word segmentation, and the segmented words are sent to a Lucene index to create an inverted index; when creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated through template extension; then adding a Field object in the Lucene document object; creating parameters of the Field instance as a sentence subjected to word segmentation and a triple used for generating the sentence; and after all the processing is finished, storing the created index.

Further, storing the created index is implemented using an IndexWriter, i.e., storing the created index using an IndexWriter.

Further, a word segmentation tool is used for segmenting the query text queried by the user, and the word segmentation tool is the same as the word segmentation tool used in constructing the full-text index.

Further, in the json-format text file, a json key corresponds to a relationship in a triple, in the json value field, "alias" represents a triple relationship name sharing the template, and "templates" is a specific template.

Question-generating-based knowledge base question-answering apparatus for storing and/or operating the question-generating-based knowledge base question-answering system according to one of claims 1 to 6.

Has the advantages that:

the invention does not need manually labeled training data, saves labeling cost, reduces development time and greatly reduces workload. The invention can quickly develop the available knowledge base question-answering system on the given knowledge map by compiling the template and matching with the text retrieval method and the semantic matching pre-training model, thereby greatly reducing the development cost, having the question-answering effect comparable to that of the mainstream method and greatly facilitating the application of the knowledge base question-answering system.

Drawings

FIG. 1 is an exemplary diagram of a knowledge base question-and-answer process based on question generation.

Detailed Description

The first embodiment is as follows: the present embodiment is described in detail with reference to figure 1,

an important characteristic of the invention is that the question-answering system can be constructed without the need of question-answering marking data. The invention expands the triples into problem sentences of complete semantic information by writing the template, and establishes full-text index on the expanded problems. When a user inquires, the system firstly carries out full-text retrieval to obtain a candidate set, then ranks the candidate set through a pre-trained semantic matching module, and selects the triple with the highest score as an answer to return to the user. The flow of the overall process is described in detail below.

The question-answering system of the knowledge base based on question generation according to the embodiment includes:

s1. template writing and triple extension:

the template is a text file in json format for expanding triple semantic information, which is written by a developer aiming at a knowledge base using scene, and the examples and the descriptions are as follows:

in the json format text, the json key "clinical manifestation" corresponds to a relationship in a triple. In the json value field, "alias" represents the name of the triple relationship that shares the template, and all triples containing such relationships are expanded through the strip of template. "templates" is a specific template.

In the templates, $1, $2 represent entity 1 and entity 2, respectively. During expansion, the program reads in the triples in sequence, and the triples are analyzed into the forms of the entity 1, the relationship and the entity 2. All templates in the relationship are then selected from the template library. Finally, the symbols $1, $2 are replaced with the corresponding entities, and the content after $1, $2 represents the extension to entity 1 and entity 2, generating sentences. For example, for the triplet < bronchial asthma > < common symptom > < dry cough > would be expanded to "what symptom the bronchial asthma has", "what disease the dry cough may be", and "what the bronchial asthma has" three sentences of semantic information complete sentences. The expanded sentence is stored in text form.

s2. full text retrieval module: the invention realizes the full-text retrieval function by using Lucene. Lucene is a suite of open source libraries for full-text retrieval supported and provided by the Apache software foundation. Lucene provides a set of simple and effective interfaces, and developers can quickly construct full-text retrieval application through Lucene.

s2.1 when constructing the full text index, firstly, the word segmentation is carried out, and the word segmentation is carried out by using an LTP Chinese natural language processing tool developed by the research center of social computing and information retrieval of Harbin university. For example, "what coughs may be" is sliced as "coughs/may/is/what/disease/o". And sending the segmented words into a Lucene index to create an inverted index. When creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated by template extension. And then adding a Field object in the document object of Lucene. The parameters for creating the Field instance are the sentence subjected to word segmentation and the triple used for generating the sentence. If only the sentence processed by word segmentation is transmitted, the triples cannot be located in the subsequent retrieval. After all processing is complete, the created index is stored to disk using IndexWriter.

s2.2 when a user query is queried, the query text is first segmented into terms using the same segmentation tool. The Query statement cut into terms is then converted into a Lucene internal representation Query object using the QueryParser class. And finally, retrieving a group of sentences relevant to the user query as a candidate set through an IndexSearcher interface provided by Lucene.

s3. semantic matching module: in addition to the search method based on full-text retrieval, in order to further improve the question-answering effect of the method and endow deep semantic matching capability, the invention adopts a semantic matching network based on a pre-training model Bert to rank the candidate sets. Bert is a deep language model proposed by Google corporation, which initially released achieved the best performance over 11 natural language processing tasks. Bert adopts a multi-layer stacked transform network, the model is input into a sentence or a pair of sentences, the output is vector representation corresponding to each word in the input text, and the vector dimension is 768. In the semantic matching model adopted by the invention, the query text which is queried by the user and the candidate set which is retrieved in 2.2 are input. For example, the user queries "what kind of disease may be dry cough", and one result of the candidate set is "what symptom of bronchial asthma", the model input is "[ CLS ] what kind of disease [ SEP ] that dry cough may have what symptom [ CLS ]", where the first [ CLS ] label is a classification label, and the output vector corresponding to the label is taken to obtain the semantic matching score through the SoftMax layer. After all candidate set texts are calculated through a semantic matching model to obtain scores, the scores are ranked from high to low, and the triples corresponding to the highest scores are taken as answers and returned to the user.

The candidate set retrieved by the full-text retrieval module contains the generated sentences and the triples used for generating the sentences. The semantic matching module finds out the sentence with the most relevant semantic (with the highest matching score) from the candidate set, and takes the triple corresponding to the sentence as an answer. The functions performed by the two modules are "search" and are only done with different "features". The reason for the design is that the semantic matching method has good effect but large calculated amount, and the invention firstly uses the full text retrieval mode to screen once to reduce the calculated amount and then obtains very good effect by the semantic matching method.

The second embodiment is as follows:

the problem generation-based knowledge base question-answering device of the embodiment is used for storing and/or operating a problem generation-based knowledge base question-answering system.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A knowledge base question-answering system based on question generation, comprising:

2. The question-answering system of the knowledge base based on question generation according to claim 1, wherein the full-text retrieval module is used for realizing the text query process based on the full-text index, when the full-text index is constructed, a word segmentation tool is firstly used for carrying out word segmentation, and the segmented words are sent to a Lucene index to create an inverted index; when creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated through template extension; then adding a Field object in the Lucene document object; creating parameters of the Field instance as a sentence subjected to word segmentation and a triple used for generating the sentence; and after all the processing is finished, storing the created index.

3. The problem-generation-based knowledge base question-answering system according to claim 2, wherein storing the created index is implemented using an IndexWriter, i.e., storing the created index using the IndexWriter.

4. The system of claim 1, 2 or 3, wherein a segmentation tool is used to segment the query text of the user query, and the segmentation tool is the same as the segmentation tool used in constructing the full-text index.

5. The question-answering system of the knowledge base based on question generation according to claim 4, wherein the word segmentation tool is an LTP Chinese natural language processing tool developed by the research center for social computing and information retrieval of Harbin university of industry.

6. The system of claim 4, wherein json key values in the json-formatted text file correspond to relationships in triples, and in the json value field, "alias" represents the name of the triplet relationship sharing the template, and "templates" is a specific template.

7. Question-generating knowledge-base question-answering apparatus, characterized in that said apparatus is adapted to store and/or operate a question-generating knowledge-base question-answering system according to one of claims 1 to 6.