CN112015915A - Question-answering system and device based on knowledge base generated by questions - Google Patents
Question-answering system and device based on knowledge base generated by questions Download PDFInfo
- Publication number
- CN112015915A CN112015915A CN202010902568.8A CN202010902568A CN112015915A CN 112015915 A CN112015915 A CN 112015915A CN 202010902568 A CN202010902568 A CN 202010902568A CN 112015915 A CN112015915 A CN 112015915A
- Authority
- CN
- China
- Prior art keywords
- question
- text
- query
- template
- triple
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 7
- 230000011218 segmentation Effects 0.000 claims description 20
- 238000003058 natural language processing Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 abstract description 4
- 238000012163 sequencing technique Methods 0.000 abstract 1
- 206010011224 Cough Diseases 0.000 description 6
- 208000006673 asthma Diseases 0.000 description 4
- 201000010099 disease Diseases 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 208000017574 dry cough Diseases 0.000 description 4
- 208000030603 inherited susceptibility to asthma Diseases 0.000 description 4
- 208000024891 symptom Diseases 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/186—Templates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A knowledge base question-answering system and device based on question generation relates to an automatic question-answering system. The problem that a person with professional knowledge is required to label a special data set based on a knowledge graph question-answering method, so that the labeling cost is high, the workload is large, and the time consumption is long is solved. The template database of the system is used for storing the templates; the triple expansion module reads in the triple, analyzes the triple and selects all templates under the relation from the template library; replacing the entity with the corresponding symbol of the triple in the template to generate a sentence; the full-text retrieval module divides a Query text queried by a user, converts a Query sentence divided into words into a Lucene internal representation Query object, and retrieves a group of sentences related to the Query of the user as a candidate set; a semantic matching module: and (4) sequencing the candidate set by adopting a semantic matching network based on a pre-training model Bert, and taking the triple corresponding to the highest score as an answer to return to the user. The method is mainly used for realizing automatic question answering.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to an automatic question answering system.
Background
With the development of science and technology, automatic answering robots, systems or voice assistants are rapidly developed. Knowledge base question-answering system based on knowledge map can satisfy the demand that people acquire knowledge fast because can answer fact class problem directly, receives the attention of academic and industrial world more and more.
The knowledge-graph is data organized in a triplet format, such as < yaoming > < nationality > < china >, where yaoming and china are entity 1 and entity 2 and nationality is the relationship between the two entities. The input to such a question-and-answer system is a text query q, then one or a set of triples most relevant to the query is found in the knowledge base, and the corresponding entities in the triples are returned. At present, the mainstream methods include a method based on relationship classification, a method based on search and a method based on semantic analysis. Taking a method based on relational classification as an example, the method firstly predicts an entity and a relation from a question, and then finds an answer entity according to the two. The common characteristic of these methods is that the prediction model needs to be trained by question and its corresponding logical expression data. Compared with the construction of knowledge maps, the integration cost of the special data for labeling is higher, and a labeler needs to master certain professional knowledge including field professional knowledge and query language knowledge. The problem of high construction cost of the data set causes that the application of the knowledge base question-answering system based on the knowledge graph is limited, and the knowledge graph cannot be effectively utilized to construct the question-answering system under the condition of lacking of training data.
Disclosure of Invention
The invention aims to solve the problems of high labeling cost, large workload and long consumed time due to the fact that a knowledge-graph-based question-answering method needs a person with professional knowledge to label a special data set.
A knowledge base question-answering system based on question generation, comprising:
a template database: storing the template; the template is a text file in json format for expanding triple semantic information and written aiming at a knowledge base use scene;
the triple extension module: reading in the triples and analyzing the triples into the forms of an entity 1, a relationship and an entity 2; then all templates under the relation are selected from the template library; replacing the corresponding symbols of the triples in the template with the entities 1 and 2 to generate sentences;
a full text retrieval module: firstly, segmenting a Query text queried by a user by using a word segmentation tool, and then converting a Query statement segmented into words into a Query object inside Lucene by using a QueryParser class; finally, a group of sentences relevant to the user query is retrieved through an IndexSearcher interface provided by Lucene to serve as a candidate set;
a semantic matching module: sorting the candidate sets by adopting a semantic matching network based on a pre-training model Bert; inputting a query text and a candidate set for retrieval queried by a user in an adopted semantic matching model; obtaining a semantic matching score from an output vector corresponding to the classification label through a SoftMax layer; after all candidate set texts are calculated through a semantic matching model to obtain scores, the scores are ranked from high to low, and the triples corresponding to the highest scores are taken as answers and returned to the user.
Furthermore, the full-text retrieval module realizes the text query process based on the full-text index, when the full-text index is constructed, a word segmentation tool is firstly utilized to carry out word segmentation, and the segmented words are sent to a Lucene index to create an inverted index; when creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated through template extension; then adding a Field object in the Lucene document object; creating parameters of the Field instance as a sentence subjected to word segmentation and a triple used for generating the sentence; and after all the processing is finished, storing the created index.
Further, storing the created index is implemented using an IndexWriter, i.e., storing the created index using an IndexWriter.
Further, a word segmentation tool is used for segmenting the query text queried by the user, and the word segmentation tool is the same as the word segmentation tool used in constructing the full-text index.
Further, in the json-format text file, a json key corresponds to a relationship in a triple, in the json value field, "alias" represents a triple relationship name sharing the template, and "templates" is a specific template.
Question-generating-based knowledge base question-answering apparatus for storing and/or operating the question-generating-based knowledge base question-answering system according to one of claims 1 to 6.
Has the advantages that:
the invention does not need manually labeled training data, saves labeling cost, reduces development time and greatly reduces workload. The invention can quickly develop the available knowledge base question-answering system on the given knowledge map by compiling the template and matching with the text retrieval method and the semantic matching pre-training model, thereby greatly reducing the development cost, having the question-answering effect comparable to that of the mainstream method and greatly facilitating the application of the knowledge base question-answering system.
Drawings
FIG. 1 is an exemplary diagram of a knowledge base question-and-answer process based on question generation.
Detailed Description
The first embodiment is as follows: the present embodiment is described in detail with reference to figure 1,
an important characteristic of the invention is that the question-answering system can be constructed without the need of question-answering marking data. The invention expands the triples into problem sentences of complete semantic information by writing the template, and establishes full-text index on the expanded problems. When a user inquires, the system firstly carries out full-text retrieval to obtain a candidate set, then ranks the candidate set through a pre-trained semantic matching module, and selects the triple with the highest score as an answer to return to the user. The flow of the overall process is described in detail below.
The question-answering system of the knowledge base based on question generation according to the embodiment includes:
s1. template writing and triple extension:
the template is a text file in json format for expanding triple semantic information, which is written by a developer aiming at a knowledge base using scene, and the examples and the descriptions are as follows:
in the json format text, the json key "clinical manifestation" corresponds to a relationship in a triple. In the json value field, "alias" represents the name of the triple relationship that shares the template, and all triples containing such relationships are expanded through the strip of template. "templates" is a specific template.
In the templates, $1, $2 represent entity 1 and entity 2, respectively. During expansion, the program reads in the triples in sequence, and the triples are analyzed into the forms of the entity 1, the relationship and the entity 2. All templates in the relationship are then selected from the template library. Finally, the symbols $1, $2 are replaced with the corresponding entities, and the content after $1, $2 represents the extension to entity 1 and entity 2, generating sentences. For example, for the triplet < bronchial asthma > < common symptom > < dry cough > would be expanded to "what symptom the bronchial asthma has", "what disease the dry cough may be", and "what the bronchial asthma has" three sentences of semantic information complete sentences. The expanded sentence is stored in text form.
s2. full text retrieval module: the invention realizes the full-text retrieval function by using Lucene. Lucene is a suite of open source libraries for full-text retrieval supported and provided by the Apache software foundation. Lucene provides a set of simple and effective interfaces, and developers can quickly construct full-text retrieval application through Lucene.
s2.1 when constructing the full text index, firstly, the word segmentation is carried out, and the word segmentation is carried out by using an LTP Chinese natural language processing tool developed by the research center of social computing and information retrieval of Harbin university. For example, "what coughs may be" is sliced as "coughs/may/is/what/disease/o". And sending the segmented words into a Lucene index to create an inverted index. When creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated by template extension. And then adding a Field object in the document object of Lucene. The parameters for creating the Field instance are the sentence subjected to word segmentation and the triple used for generating the sentence. If only the sentence processed by word segmentation is transmitted, the triples cannot be located in the subsequent retrieval. After all processing is complete, the created index is stored to disk using IndexWriter.
s2.2 when a user query is queried, the query text is first segmented into terms using the same segmentation tool. The Query statement cut into terms is then converted into a Lucene internal representation Query object using the QueryParser class. And finally, retrieving a group of sentences relevant to the user query as a candidate set through an IndexSearcher interface provided by Lucene.
s3. semantic matching module: in addition to the search method based on full-text retrieval, in order to further improve the question-answering effect of the method and endow deep semantic matching capability, the invention adopts a semantic matching network based on a pre-training model Bert to rank the candidate sets. Bert is a deep language model proposed by Google corporation, which initially released achieved the best performance over 11 natural language processing tasks. Bert adopts a multi-layer stacked transform network, the model is input into a sentence or a pair of sentences, the output is vector representation corresponding to each word in the input text, and the vector dimension is 768. In the semantic matching model adopted by the invention, the query text which is queried by the user and the candidate set which is retrieved in 2.2 are input. For example, the user queries "what kind of disease may be dry cough", and one result of the candidate set is "what symptom of bronchial asthma", the model input is "[ CLS ] what kind of disease [ SEP ] that dry cough may have what symptom [ CLS ]", where the first [ CLS ] label is a classification label, and the output vector corresponding to the label is taken to obtain the semantic matching score through the SoftMax layer. After all candidate set texts are calculated through a semantic matching model to obtain scores, the scores are ranked from high to low, and the triples corresponding to the highest scores are taken as answers and returned to the user.
The candidate set retrieved by the full-text retrieval module contains the generated sentences and the triples used for generating the sentences. The semantic matching module finds out the sentence with the most relevant semantic (with the highest matching score) from the candidate set, and takes the triple corresponding to the sentence as an answer. The functions performed by the two modules are "search" and are only done with different "features". The reason for the design is that the semantic matching method has good effect but large calculated amount, and the invention firstly uses the full text retrieval mode to screen once to reduce the calculated amount and then obtains very good effect by the semantic matching method.
The second embodiment is as follows:
the problem generation-based knowledge base question-answering device of the embodiment is used for storing and/or operating a problem generation-based knowledge base question-answering system.
The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.
Claims (7)
1. A knowledge base question-answering system based on question generation, comprising:
a template database: storing the template; the template is a text file in json format for expanding triple semantic information and written aiming at a knowledge base use scene;
the triple extension module: reading in the triples and analyzing the triples into the forms of an entity 1, a relationship and an entity 2; then all templates under the relation are selected from the template library; replacing the corresponding symbols of the triples in the template with the entities 1 and 2 to generate sentences;
a full text retrieval module: firstly, segmenting a Query text queried by a user by using a word segmentation tool, and then converting a Query statement segmented into words into a Query object inside Lucene by using a QueryParser class; finally, a group of sentences relevant to the user query is retrieved through an IndexSearcher interface provided by Lucene to serve as a candidate set;
a semantic matching module: sorting the candidate sets by adopting a semantic matching network based on a pre-training model Bert; inputting a query text and a candidate set for retrieval queried by a user in an adopted semantic matching model; obtaining a semantic matching score from an output vector corresponding to the classification label through a SoftMax layer; after all candidate set texts are calculated through a semantic matching model to obtain scores, the scores are ranked from high to low, and the triples corresponding to the highest scores are taken as answers and returned to the user.
2. The question-answering system of the knowledge base based on question generation according to claim 1, wherein the full-text retrieval module is used for realizing the text query process based on the full-text index, when the full-text index is constructed, a word segmentation tool is firstly used for carrying out word segmentation, and the segmented words are sent to a Lucene index to create an inverted index; when creating the inverted index, firstly creating a Lucene Document object Document, wherein the Document object Document corresponds to a sentence generated through template extension; then adding a Field object in the Lucene document object; creating parameters of the Field instance as a sentence subjected to word segmentation and a triple used for generating the sentence; and after all the processing is finished, storing the created index.
3. The problem-generation-based knowledge base question-answering system according to claim 2, wherein storing the created index is implemented using an IndexWriter, i.e., storing the created index using the IndexWriter.
4. The system of claim 1, 2 or 3, wherein a segmentation tool is used to segment the query text of the user query, and the segmentation tool is the same as the segmentation tool used in constructing the full-text index.
5. The question-answering system of the knowledge base based on question generation according to claim 4, wherein the word segmentation tool is an LTP Chinese natural language processing tool developed by the research center for social computing and information retrieval of Harbin university of industry.
6. The system of claim 4, wherein json key values in the json-formatted text file correspond to relationships in triples, and in the json value field, "alias" represents the name of the triplet relationship sharing the template, and "templates" is a specific template.
7. Question-generating knowledge-base question-answering apparatus, characterized in that said apparatus is adapted to store and/or operate a question-generating knowledge-base question-answering system according to one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010902568.8A CN112015915A (en) | 2020-09-01 | 2020-09-01 | Question-answering system and device based on knowledge base generated by questions |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010902568.8A CN112015915A (en) | 2020-09-01 | 2020-09-01 | Question-answering system and device based on knowledge base generated by questions |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112015915A true CN112015915A (en) | 2020-12-01 |
Family
ID=73516499
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010902568.8A Pending CN112015915A (en) | 2020-09-01 | 2020-09-01 | Question-answering system and device based on knowledge base generated by questions |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112015915A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722452A (en) * | 2021-07-16 | 2021-11-30 | 上海通办信息服务有限公司 | Semantic-based quick knowledge hit method and device in question-answering system |
CN114003708A (en) * | 2021-11-05 | 2022-02-01 | 中国平安人寿保险股份有限公司 | Automatic question answering method and device based on artificial intelligence, storage medium and server |
CN117271611A (en) * | 2023-11-21 | 2023-12-22 | 中国电子科技集团公司第十五研究所 | Information retrieval method, device and equipment based on large model |
CN117556054A (en) * | 2023-11-14 | 2024-02-13 | 哈尔滨工业大学 | Knowledge graph construction method and management system based on large language model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868313A (en) * | 2016-03-25 | 2016-08-17 | 浙江大学 | Mapping knowledge domain questioning and answering system and method based on template matching technique |
WO2017076263A1 (en) * | 2015-11-03 | 2017-05-11 | 中兴通讯股份有限公司 | Method and device for integrating knowledge bases, knowledge base management system and storage medium |
US20190065576A1 (en) * | 2017-08-23 | 2019-02-28 | Rsvp Technologies Inc. | Single-entity-single-relation question answering systems, and methods |
CN110309267A (en) * | 2019-07-08 | 2019-10-08 | 哈尔滨工业大学 | Semantic retrieving method and system based on pre-training model |
-
2020
- 2020-09-01 CN CN202010902568.8A patent/CN112015915A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017076263A1 (en) * | 2015-11-03 | 2017-05-11 | 中兴通讯股份有限公司 | Method and device for integrating knowledge bases, knowledge base management system and storage medium |
CN105868313A (en) * | 2016-03-25 | 2016-08-17 | 浙江大学 | Mapping knowledge domain questioning and answering system and method based on template matching technique |
US20190065576A1 (en) * | 2017-08-23 | 2019-02-28 | Rsvp Technologies Inc. | Single-entity-single-relation question answering systems, and methods |
CN110309267A (en) * | 2019-07-08 | 2019-10-08 | 哈尔滨工业大学 | Semantic retrieving method and system based on pre-training model |
Non-Patent Citations (2)
Title |
---|
HAI JIN 等: "ComQA: Question Answering Over Knowledge Base via Semantic Matching", IEEE ACCESS, vol. 7, 23 May 2019 (2019-05-23), pages 75235 - 75246, XP011731050, DOI: 10.1109/ACCESS.2019.2918675 * |
车万翔 等: "基于问题生成的知识图谱问答方法", 智能计算机与应用, vol. 10, no. 5, 1 May 2020 (2020-05-01), pages 1 - 5 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722452A (en) * | 2021-07-16 | 2021-11-30 | 上海通办信息服务有限公司 | Semantic-based quick knowledge hit method and device in question-answering system |
CN113722452B (en) * | 2021-07-16 | 2024-01-19 | 上海通办信息服务有限公司 | Semantic-based rapid knowledge hit method and device in question-answering system |
CN114003708A (en) * | 2021-11-05 | 2022-02-01 | 中国平安人寿保险股份有限公司 | Automatic question answering method and device based on artificial intelligence, storage medium and server |
CN117556054A (en) * | 2023-11-14 | 2024-02-13 | 哈尔滨工业大学 | Knowledge graph construction method and management system based on large language model |
CN117271611A (en) * | 2023-11-21 | 2023-12-22 | 中国电子科技集团公司第十五研究所 | Information retrieval method, device and equipment based on large model |
CN117271611B (en) * | 2023-11-21 | 2024-02-13 | 中国电子科技集团公司第十五研究所 | Information retrieval method, device and equipment based on large model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442718B (en) | Statement processing method and device, server and storage medium | |
WO2018000272A1 (en) | Corpus generation device and method | |
US11210468B2 (en) | System and method for comparing plurality of documents | |
Zubrinic et al. | The automatic creation of concept maps from documents written using morphologically rich languages | |
CN112015915A (en) | Question-answering system and device based on knowledge base generated by questions | |
CN111680173A (en) | CMR model for uniformly retrieving cross-media information | |
CN106776532B (en) | Knowledge question-answering method and device | |
CN110612522B (en) | Establishment of solid model | |
JP2004110161A (en) | Text sentence comparing device | |
CN104484666A (en) | Advanced image semantic parsing method based on human-computer interaction | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
US20230205996A1 (en) | Automatic Synonyms Using Word Embedding and Word Similarity Models | |
CN110276080B (en) | Semantic processing method and system | |
CN112115252B (en) | Intelligent auxiliary writing processing method and device, electronic equipment and storage medium | |
Gygli et al. | Efficient object annotation via speaking and pointing | |
Chai | Design and implementation of English intelligent communication platform based on similarity algorithm | |
CN113190692B (en) | Self-adaptive retrieval method, system and device for knowledge graph | |
Khadija et al. | Automating information retrieval from faculty guidelines: designing a PDF-driven chatbot powered by OpenAI ChatGPT | |
Gammack et al. | Semantic knowledge management system for design documentation with heterogeneous data using machine learning | |
CN117216617A (en) | Text classification model training method, device, computer equipment and storage medium | |
CN115905677A (en) | Intelligent search system for medical field | |
CN114840657A (en) | API knowledge graph self-adaptive construction and intelligent question-answering method based on mixed mode | |
CN114118082A (en) | Resume retrieval method and device | |
Zhou et al. | Challenges and Future Development of Question Answering Systems in the Construction Industry | |
Arbizu | Extracting knowledge from documents to construct concept maps |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |