CN112015915A - Question-answering system and device based on knowledge base generated by questions - Google Patents

Question-answering system and device based on knowledge base generated by questions Download PDF

Info

Publication number
CN112015915A
CN112015915A CN202010902568.8A CN202010902568A CN112015915A CN 112015915 A CN112015915 A CN 112015915A CN 202010902568 A CN202010902568 A CN 202010902568A CN 112015915 A CN112015915 A CN 112015915A
Authority
CN
China
Prior art keywords
question
text
query
template
triple
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010902568.8A
Other languages
Chinese (zh)
Inventor
车万翔
乔振浩
赵妍妍
刘挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202010902568.8A priority Critical patent/CN112015915A/en
Publication of CN112015915A publication Critical patent/CN112015915A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A knowledge base question-answering system and device based on question generation relates to an automatic question-answering system. The problem that a person with professional knowledge is required to label a special data set based on a knowledge graph question-answering method, so that the labeling cost is high, the workload is large, and the time consumption is long is solved. The template database of the system is used for storing the templates; the triple expansion module reads in the triple, analyzes the triple and selects all templates under the relation from the template library; replacing the entity with the corresponding symbol of the triple in the template to generate a sentence; the full-text retrieval module divides a Query text queried by a user, converts a Query sentence divided into words into a Lucene internal representation Query object, and retrieves a group of sentences related to the Query of the user as a candidate set; a semantic matching module: and (4) sequencing the candidate set by adopting a semantic matching network based on a pre-training model Bert, and taking the triple corresponding to the highest score as an answer to return to the user. The method is mainly used for realizing automatic question answering.

Description

Question-answering system and device based on knowledge base generated by questions
Technical Field
The invention belongs to the technical field of computers, and particularly relates to an automatic question answering system.
Background
With the development of science and technology, automatic answering robots, systems or voice assistants are rapidly developed. Knowledge base question-answering system based on knowledge map can satisfy the demand that people acquire knowledge fast because can answer fact class problem directly, receives the attention of academic and industrial world more and more.
The knowledge-graph is data organized in a triplet format, such as < yaoming > < nationality > < china >, where yaoming and china are entity 1 and entity 2 and nationality is the relationship between the two entities. The input to such a question-and-answer system is a text query q, then one or a set of triples most relevant to the query is found in the knowledge base, and the corresponding entities in the triples are returned. At present, the mainstream methods include a method based on relationship classification, a method based on search and a method based on semantic analysis. Taking a method based on relational classification as an example, the method firstly predicts an entity and a relation from a question, and then finds an answer entity according to the two. The common characteristic of these methods is that the prediction model needs to be trained by question and its corresponding logical expression data. Compared with the construction of knowledge maps, the integration cost of the special data for labeling is higher, and a labeler needs to master certain professional knowledge including field professional knowledge and query language knowledge. The problem of high construction cost of the data set causes that the application of the knowledge base question-answering system based on the knowledge graph is limited, and the knowledge graph cannot be effectively utilized to construct the question-answering system under the condition of lacking of training data.
Disclosure of Invention
The invention aims to solve the problems of high labeling cost, large workload and long consumed time due to the fact that a knowledge-graph-based question-answering method needs a person with professional knowledge to label a special data set.
A knowledge base question-answering system based on question generation, comprising:
a template database: storing the template; the template is a text file in json format for expanding triple semantic information and written aiming at a knowledge base use scene;
the triple extension module: reading in the triples and analyzing the triples into the forms of an entity 1, a relationship and an entity 2; then all templates under the relation are selected from the template library; replacing the corresponding symbols of the triples in the template with the entities 1 and 2 to generate sentences;
a full text retrieval module: firstly, segmenting a Query text queried by a user by using a word segmentation tool, and then converting a Query statement segmented into words into a Query object inside Lucene by using a QueryParser class; finally, a group of sentences relevant to the user query is retrieved through an IndexSearcher interface provided by Lucene to serve as a candidate set;
a semantic matching module: sorting the candidate sets by adopting a semantic matching network based on a pre-training model Bert; inputting a query text and a candidate set for retrieval queried by a user in an adopted semantic matching model; obtaining a semantic matching score from an output vector corresponding to the classification label through a SoftMax layer; after all candidate set texts are calculated through a semantic matching model to obtain scores, the scores are ranked from high to low, and the triples corresponding to the highest scores are taken as answers and returned to the user.
Furthermore, the full-text retrieval module realizes the text query process based on the full-text index, when the full-text index is constructed, a word segmentation tool is firstly utilized to carry out word segmentation, and the segmented words are sent to a Lucene index to create an inverted index; when creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated through template extension; then adding a Field object in the Lucene document object; creating parameters of the Field instance as a sentence subjected to word segmentation and a triple used for generating the sentence; and after all the processing is finished, storing the created index.
Further, storing the created index is implemented using an IndexWriter, i.e., storing the created index using an IndexWriter.
Further, a word segmentation tool is used for segmenting the query text queried by the user, and the word segmentation tool is the same as the word segmentation tool used in constructing the full-text index.
Further, in the json-format text file, a json key corresponds to a relationship in a triple, in the json value field, "alias" represents a triple relationship name sharing the template, and "templates" is a specific template.
Question-generating-based knowledge base question-answering apparatus for storing and/or operating the question-generating-based knowledge base question-answering system according to one of claims 1 to 6.
Has the advantages that:
the invention does not need manually labeled training data, saves labeling cost, reduces development time and greatly reduces workload. The invention can quickly develop the available knowledge base question-answering system on the given knowledge map by compiling the template and matching with the text retrieval method and the semantic matching pre-training model, thereby greatly reducing the development cost, having the question-answering effect comparable to that of the mainstream method and greatly facilitating the application of the knowledge base question-answering system.
Drawings
FIG. 1 is an exemplary diagram of a knowledge base question-and-answer process based on question generation.
Detailed Description
The first embodiment is as follows: the present embodiment is described in detail with reference to figure 1,
an important characteristic of the invention is that the question-answering system can be constructed without the need of question-answering marking data. The invention expands the triples into problem sentences of complete semantic information by writing the template, and establishes full-text index on the expanded problems. When a user inquires, the system firstly carries out full-text retrieval to obtain a candidate set, then ranks the candidate set through a pre-trained semantic matching module, and selects the triple with the highest score as an answer to return to the user. The flow of the overall process is described in detail below.
The question-answering system of the knowledge base based on question generation according to the embodiment includes:
s1. template writing and triple extension:
the template is a text file in json format for expanding triple semantic information, which is written by a developer aiming at a knowledge base using scene, and the examples and the descriptions are as follows:
Figure BDA0002660263930000031
in the json format text, the json key "clinical manifestation" corresponds to a relationship in a triple. In the json value field, "alias" represents the name of the triple relationship that shares the template, and all triples containing such relationships are expanded through the strip of template. "templates" is a specific template.
In the templates, $1, $2 represent entity 1 and entity 2, respectively. During expansion, the program reads in the triples in sequence, and the triples are analyzed into the forms of the entity 1, the relationship and the entity 2. All templates in the relationship are then selected from the template library. Finally, the symbols $1, $2 are replaced with the corresponding entities, and the content after $1, $2 represents the extension to entity 1 and entity 2, generating sentences. For example, for the triplet < bronchial asthma > < common symptom > < dry cough > would be expanded to "what symptom the bronchial asthma has", "what disease the dry cough may be", and "what the bronchial asthma has" three sentences of semantic information complete sentences. The expanded sentence is stored in text form.
s2. full text retrieval module: the invention realizes the full-text retrieval function by using Lucene. Lucene is a suite of open source libraries for full-text retrieval supported and provided by the Apache software foundation. Lucene provides a set of simple and effective interfaces, and developers can quickly construct full-text retrieval application through Lucene.
s2.1 when constructing the full text index, firstly, the word segmentation is carried out, and the word segmentation is carried out by using an LTP Chinese natural language processing tool developed by the research center of social computing and information retrieval of Harbin university. For example, "what coughs may be" is sliced as "coughs/may/is/what/disease/o". And sending the segmented words into a Lucene index to create an inverted index. When creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated by template extension. And then adding a Field object in the document object of Lucene. The parameters for creating the Field instance are the sentence subjected to word segmentation and the triple used for generating the sentence. If only the sentence processed by word segmentation is transmitted, the triples cannot be located in the subsequent retrieval. After all processing is complete, the created index is stored to disk using IndexWriter.
s2.2 when a user query is queried, the query text is first segmented into terms using the same segmentation tool. The Query statement cut into terms is then converted into a Lucene internal representation Query object using the QueryParser class. And finally, retrieving a group of sentences relevant to the user query as a candidate set through an IndexSearcher interface provided by Lucene.
s3. semantic matching module: in addition to the search method based on full-text retrieval, in order to further improve the question-answering effect of the method and endow deep semantic matching capability, the invention adopts a semantic matching network based on a pre-training model Bert to rank the candidate sets. Bert is a deep language model proposed by Google corporation, which initially released achieved the best performance over 11 natural language processing tasks. Bert adopts a multi-layer stacked transform network, the model is input into a sentence or a pair of sentences, the output is vector representation corresponding to each word in the input text, and the vector dimension is 768. In the semantic matching model adopted by the invention, the query text which is queried by the user and the candidate set which is retrieved in 2.2 are input. For example, the user queries "what kind of disease may be dry cough", and one result of the candidate set is "what symptom of bronchial asthma", the model input is "[ CLS ] what kind of disease [ SEP ] that dry cough may have what symptom [ CLS ]", where the first [ CLS ] label is a classification label, and the output vector corresponding to the label is taken to obtain the semantic matching score through the SoftMax layer. After all candidate set texts are calculated through a semantic matching model to obtain scores, the scores are ranked from high to low, and the triples corresponding to the highest scores are taken as answers and returned to the user.
The candidate set retrieved by the full-text retrieval module contains the generated sentences and the triples used for generating the sentences. The semantic matching module finds out the sentence with the most relevant semantic (with the highest matching score) from the candidate set, and takes the triple corresponding to the sentence as an answer. The functions performed by the two modules are "search" and are only done with different "features". The reason for the design is that the semantic matching method has good effect but large calculated amount, and the invention firstly uses the full text retrieval mode to screen once to reduce the calculated amount and then obtains very good effect by the semantic matching method.
The second embodiment is as follows:
the problem generation-based knowledge base question-answering device of the embodiment is used for storing and/or operating a problem generation-based knowledge base question-answering system.
The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims (7)

1. A knowledge base question-answering system based on question generation, comprising:
a template database: storing the template; the template is a text file in json format for expanding triple semantic information and written aiming at a knowledge base use scene;
the triple extension module: reading in the triples and analyzing the triples into the forms of an entity 1, a relationship and an entity 2; then all templates under the relation are selected from the template library; replacing the corresponding symbols of the triples in the template with the entities 1 and 2 to generate sentences;
a full text retrieval module: firstly, segmenting a Query text queried by a user by using a word segmentation tool, and then converting a Query statement segmented into words into a Query object inside Lucene by using a QueryParser class; finally, a group of sentences relevant to the user query is retrieved through an IndexSearcher interface provided by Lucene to serve as a candidate set;
a semantic matching module: sorting the candidate sets by adopting a semantic matching network based on a pre-training model Bert; inputting a query text and a candidate set for retrieval queried by a user in an adopted semantic matching model; obtaining a semantic matching score from an output vector corresponding to the classification label through a SoftMax layer; after all candidate set texts are calculated through a semantic matching model to obtain scores, the scores are ranked from high to low, and the triples corresponding to the highest scores are taken as answers and returned to the user.
2. The question-answering system of the knowledge base based on question generation according to claim 1, wherein the full-text retrieval module is used for realizing the text query process based on the full-text index, when the full-text index is constructed, a word segmentation tool is firstly used for carrying out word segmentation, and the segmented words are sent to a Lucene index to create an inverted index; when creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated through template extension; then adding a Field object in the Lucene document object; creating parameters of the Field instance as a sentence subjected to word segmentation and a triple used for generating the sentence; and after all the processing is finished, storing the created index.
3. The problem-generation-based knowledge base question-answering system according to claim 2, wherein storing the created index is implemented using an IndexWriter, i.e., storing the created index using the IndexWriter.
4. The system of claim 1, 2 or 3, wherein a segmentation tool is used to segment the query text of the user query, and the segmentation tool is the same as the segmentation tool used in constructing the full-text index.
5. The question-answering system of the knowledge base based on question generation according to claim 4, wherein the word segmentation tool is an LTP Chinese natural language processing tool developed by the research center for social computing and information retrieval of Harbin university of industry.
6. The system of claim 4, wherein json key values in the json-formatted text file correspond to relationships in triples, and in the json value field, "alias" represents the name of the triplet relationship sharing the template, and "templates" is a specific template.
7. Question-generating knowledge-base question-answering apparatus, characterized in that said apparatus is adapted to store and/or operate a question-generating knowledge-base question-answering system according to one of claims 1 to 6.
CN202010902568.8A 2020-09-01 2020-09-01 Question-answering system and device based on knowledge base generated by questions Pending CN112015915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010902568.8A CN112015915A (en) 2020-09-01 2020-09-01 Question-answering system and device based on knowledge base generated by questions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010902568.8A CN112015915A (en) 2020-09-01 2020-09-01 Question-answering system and device based on knowledge base generated by questions

Publications (1)

Publication Number Publication Date
CN112015915A true CN112015915A (en) 2020-12-01

Family

ID=73516499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010902568.8A Pending CN112015915A (en) 2020-09-01 2020-09-01 Question-answering system and device based on knowledge base generated by questions

Country Status (1)

Country Link
CN (1) CN112015915A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722452A (en) * 2021-07-16 2021-11-30 上海通办信息服务有限公司 Semantic-based quick knowledge hit method and device in question-answering system
CN114003708A (en) * 2021-11-05 2022-02-01 中国平安人寿保险股份有限公司 Automatic question answering method and device based on artificial intelligence, storage medium and server
CN117271611A (en) * 2023-11-21 2023-12-22 中国电子科技集团公司第十五研究所 Information retrieval method, device and equipment based on large model
CN117556054A (en) * 2023-11-14 2024-02-13 哈尔滨工业大学 Knowledge graph construction method and management system based on large language model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique
WO2017076263A1 (en) * 2015-11-03 2017-05-11 中兴通讯股份有限公司 Method and device for integrating knowledge bases, knowledge base management system and storage medium
US20190065576A1 (en) * 2017-08-23 2019-02-28 Rsvp Technologies Inc. Single-entity-single-relation question answering systems, and methods
CN110309267A (en) * 2019-07-08 2019-10-08 哈尔滨工业大学 Semantic retrieving method and system based on pre-training model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017076263A1 (en) * 2015-11-03 2017-05-11 中兴通讯股份有限公司 Method and device for integrating knowledge bases, knowledge base management system and storage medium
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique
US20190065576A1 (en) * 2017-08-23 2019-02-28 Rsvp Technologies Inc. Single-entity-single-relation question answering systems, and methods
CN110309267A (en) * 2019-07-08 2019-10-08 哈尔滨工业大学 Semantic retrieving method and system based on pre-training model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HAI JIN 等: "ComQA: Question Answering Over Knowledge Base via Semantic Matching", IEEE ACCESS, vol. 7, 23 May 2019 (2019-05-23), pages 75235 - 75246, XP011731050, DOI: 10.1109/ACCESS.2019.2918675 *
车万翔 等: "基于问题生成的知识图谱问答方法", 智能计算机与应用, vol. 10, no. 5, 1 May 2020 (2020-05-01), pages 1 - 5 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722452A (en) * 2021-07-16 2021-11-30 上海通办信息服务有限公司 Semantic-based quick knowledge hit method and device in question-answering system
CN113722452B (en) * 2021-07-16 2024-01-19 上海通办信息服务有限公司 Semantic-based rapid knowledge hit method and device in question-answering system
CN114003708A (en) * 2021-11-05 2022-02-01 中国平安人寿保险股份有限公司 Automatic question answering method and device based on artificial intelligence, storage medium and server
CN117556054A (en) * 2023-11-14 2024-02-13 哈尔滨工业大学 Knowledge graph construction method and management system based on large language model
CN117271611A (en) * 2023-11-21 2023-12-22 中国电子科技集团公司第十五研究所 Information retrieval method, device and equipment based on large model
CN117271611B (en) * 2023-11-21 2024-02-13 中国电子科技集团公司第十五研究所 Information retrieval method, device and equipment based on large model

Similar Documents

Publication Publication Date Title
CN110442718B (en) Statement processing method and device, server and storage medium
WO2018000272A1 (en) Corpus generation device and method
US11210468B2 (en) System and method for comparing plurality of documents
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
CN112015915A (en) Question-answering system and device based on knowledge base generated by questions
CN111680173A (en) CMR model for uniformly retrieving cross-media information
CN106776532B (en) Knowledge question-answering method and device
CN110612522B (en) Establishment of solid model
JP2004110161A (en) Text sentence comparing device
CN104484666A (en) Advanced image semantic parsing method based on human-computer interaction
CN112328800A (en) System and method for automatically generating programming specification question answers
US20230205996A1 (en) Automatic Synonyms Using Word Embedding and Word Similarity Models
CN110276080B (en) Semantic processing method and system
CN112115252B (en) Intelligent auxiliary writing processing method and device, electronic equipment and storage medium
Gygli et al. Efficient object annotation via speaking and pointing
Chai Design and implementation of English intelligent communication platform based on similarity algorithm
CN113190692B (en) Self-adaptive retrieval method, system and device for knowledge graph
Khadija et al. Automating information retrieval from faculty guidelines: designing a PDF-driven chatbot powered by OpenAI ChatGPT
Gammack et al. Semantic knowledge management system for design documentation with heterogeneous data using machine learning
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
CN115905677A (en) Intelligent search system for medical field
CN114840657A (en) API knowledge graph self-adaptive construction and intelligent question-answering method based on mixed mode
CN114118082A (en) Resume retrieval method and device
Zhou et al. Challenges and Future Development of Question Answering Systems in the Construction Industry
Arbizu Extracting knowledge from documents to construct concept maps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination