CN112100351A

CN112100351A - Method and equipment for constructing intelligent question-answering system through question generation data set

Info

Publication number: CN112100351A
Application number: CN202010956043.2A
Authority: CN
Inventors: 曹菡; 梁旭超
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-18

Abstract

A method and apparatus for constructing an intelligent question-answering system through a question generation data set, by constructing a travel domain knowledge map; analyzing a question of a natural language proposed by a user, performing word segmentation and word vector training on the question, and adding a preset tourism field dictionary by using a jieba tool in the word segmentation process; performing entity extraction on the natural language question by using a Bert-BilSTM + CRF model; matching the extracted entity with an entity in a knowledge graph; if the knowledge graph has a matched entity, selecting the entity; if the knowledge graph has no matched entities, performing semantic similarity calculation, and selecting the closest entities; matching the selected entities and attributes with triples in a knowledge graph; and returning the corresponding attribute value as an answer of the question to the user. The invention also provides a device, a terminal and a readable storage medium for realizing the method. The invention can conveniently and accurately return the information required by the user.

Description

Method and equipment for constructing intelligent question-answering system through question generation data set

Technical Field

The invention belongs to the field of information query and retrieval, and particularly relates to a method and equipment for constructing an intelligent question-answering system through a question generation data set, which are particularly suitable for information search of a knowledge map in the field of tourism.

Background

Since the concept of knowledge graph was proposed in 2012, all internet companies construct next-generation intelligent search engines based on the concept, such as google knowledge graph, dog knowledge cube, etc., and hope to create a brand-new information retrieval mode and provide a new idea for solving the information retrieval problem. Meanwhile, the domain knowledge graph is developed rapidly in recent years, and compared with the traditional knowledge graph, the domain knowledge graph has the advantages that data are more concentrated, processing is more convenient, ambiguity between entities is greatly reduced, and searching accuracy is improved. Meanwhile, as knowledge is gradually converted from static state to dynamic state, the information updating speed of the domain knowledge graph is superior to that of the general knowledge graph.

Due to the reasons of high construction difficulty, time consumption, labor consumption and the like of the existing data set, most of natural language processing tasks use universal data sets, such as NLPCC QA data sets, WikiQA data sets, TREC QA data sets and the like. However, with the proposition and development of domain knowledge maps, more and more methods aiming at a certain domain are proposed, and the methods commonly used at present are improved on the basis of manually collected data, including a heuristic marking method and a machine-assisted marking method, and some adopt heuristic marking based on a general disambiguation rule. For example, when marking an entity, all adjectives are deleted, just "entity header" is marked (e.g., "a girl in red booth" is changed to "girl"). It can be seen that the effort to tag large-scale datasets across a knowledge base is very large. In the KBQA task, because relationship detection is more difficult than entity linking, the QA model often fails to answer questions because of invisible predicates or phrases. On the one hand QA models generally tend to give invisible predicates a lower score. On the other hand, even if the training set contains true predicates, the QA model is difficult to answer if the corresponding paraphrases are not visible. At present, the question-answering system based on the knowledge graph also has different classification modes in structure. In the early days of template-based knowledge-graph question-answering, the method needs to prepare an intention template in advance, can match through similarity and can also solve through a classification problem in machine learning. Although this method is fast and can answer complex questions, it requires a huge template library and thus is very labor intensive. The main current research direction is that templates can be automatically updated to reduce the workload of human beings. The knowledge-graph question-answer based on semantic analysis is converted into semantic representation which can be understood by a knowledge base through semantic analysis of natural language, and then reasoning or inquiry is carried out through knowledge in the knowledge base to obtain a final answer. In short, what is needed for semantic parsing is to convert the natural language question into a semantic representation that can be understood by the knowledge base, and the semantic representation is a logical form. In addition, the existing popular intelligent question-answering system based on deep learning also appears, and some people construct a problem analysis model by using a neural network based on word vectors and long-short term memory, and for example, introduce an LSTM model and a semantic enhancement method to improve a domain feature knowledge graph to obtain good effects.

A large number of experiments prove that better results can be obtained by adopting a deep learning algorithm under the condition of large data volume compared with the traditional natural language algorithm. And the deep learning algorithm does not need to manually extract features, so the implementation is relatively simple and convenient. The CNN algorithm is the earliest algorithm used, mainly because the convolution model of the CNN algorithm is simple to implement and can capture characteristic information of data positions. But for time-series data, the LSTM algorithm is more suitable than the CNN algorithm. The LSTM algorithm is a special type of RNN, comprehensively considers the characteristics of the problem time sequence, calculates the state characteristics of the data through 3 gate functions, and has better effect than the CNN model. The LSTM model can be generally matched with other models to be used together, so that the experimental effect is better, the bidirectional LSTM model and the CRF model can be combined to establish a knowledge base question-answering system, the advantages of a conditional random field and a long-short term memory network are tried to be comprehensively utilized, and the performance of the question-answering system can also be improved. With the release of the Bert model of google, researchers began to apply the Bert model to various Natural Language Processing (NLP) questions, including domain knowledge graph-based question-answering systems. The Bert model can be simply understood as a two-segment NLP model, i.e. a pre-training and fine-tuning step. The Bert model is represented by a transform's bi-directional encoder, which, unlike most recent other language representation models, aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained Bert representation can be fine-tuned through an additional output layer, and is suitable for building the most advanced models of a wide range of tasks, such as question-answering tasks and language reasoning.

Disclosure of Invention

The invention aims to solve the problems that the data set construction in the prior art is time-consuming and labor-consuming and the tourism question-answering system is incomplete, provides a method and equipment for constructing an intelligent question-answering system by using a question generation data set, supplements the question-answering data set by using a question generation model, makes up the condition of tourism data loss, and can conveniently and accurately return information required by a user.

In order to achieve the purpose, the invention has the following technical scheme:

a method of constructing an intelligent question-answering system from question-generating datasets, comprising the steps of:

step one, establishing a tourist field knowledge map;

storing entity related information in a knowledge graph in a triple form of 'entity-attribute value', when a problem is raised, firstly, extracting entities in a question through named entity identification, linking the entities with the knowledge graph to find corresponding entities, then, utilizing question information and a preset rule template to perform entity attribute matching, and returning corresponding attribute values;

step two, performing question analysis on a natural language question provided by a user, performing word segmentation and word vector training on the question, and adding a preset tourism field dictionary by using a jieba tool in the word segmentation process;

step three, utilizing a Bert-BilSTM + CRF model to perform entity extraction on the natural language question;

step four, matching the extracted entity with an entity in the knowledge graph; if the knowledge graph has a matched entity, selecting the entity; if the knowledge graph has no matched entities, performing semantic similarity calculation, and selecting the closest entities;

matching the selected entities and attributes with triples in the knowledge graph;

and step six, returning the corresponding attribute value as an answer of the question to the user.

Preferably, the tourist domain knowledge map constructed in the step one uses question generation to construct a question-and-answer data set, the task of question generation is to generate corresponding natural language questions given the input answers, and the question-and-answer is the reverse task of finding suitable answers in the given questions.

Preferably, in the third step, firstly, a Bert pre-training language model is used for coding a single character to obtain a word vector corresponding to the single character, then, a BilSTM layer is used for carrying out bidirectional coding on an input text, and finally, a semantic vector containing context information is input into a CRF layer for decoding, and the CRF layer can output a label sequence with the maximum probability so as to obtain the category of each character.

Preferably, in the fourth step, a cosine similarity algorithm is used for semantic similarity calculation, and the cosine similarity calculation is expressed as follows:

wherein X represents an entity extracted from the question, Y represents an entity in the knowledge graph, and n represents the dimension of a word vector; and selecting the entity in the knowledge graph with the highest similarity as matching by calculating the similarity.

Preferably, the QG and QA models are combined, and by means of a knowledge base and a text corpus, the QA and QG models are firstly trained on the gold data set in a combined mode, and then the QA model is finely adjusted by means of text by means of a supplementary data set constructed by the QG model.

Preferably, the entity related information in the first step is crawled by a Python tool, the entity related information comprises tourist attractions, names, introduction, construction time and scenic spot levels, and the crawled data is cleaned and sorted to form a tourist knowledge map.

The invention also provides a device for constructing the intelligent question-answering system by the question generation data set, which comprises the following components:

the tourism domain knowledge map building module stores entity related information in a triple form of entity-attribute value, extracts entities in question sentences through named entity identification when a problem is raised, links the entities with a knowledge map to find corresponding entities, matches entity attributes by using question sentence information and combining a rule template set in advance, and returns corresponding attribute values;

the natural language question analyzing module is used for performing word segmentation and word vector training on the natural language question, and a jieba tool is used and a preset tourism field dictionary is added in the word segmentation process;

the entity extraction module is used for firstly utilizing a Bert pre-training language model to encode a single character to obtain a word vector corresponding to the single character, then utilizing a BilSTM layer to carry out bidirectional encoding on an input text, and finally inputting a semantic vector containing context information into a CRF layer to be decoded, wherein the CRF layer outputs a label sequence with the maximum probability to obtain the category of each character;

the entity matching module is used for matching the extracted entity with the entity in the knowledge graph; if the knowledge graph has a matched entity, selecting the entity; if the knowledge graph does not have the matched entity, selecting the closest entity through semantic similarity;

and the answer feedback module is used for matching the selected entity and attribute with the triples in the knowledge graph, returning corresponding attribute values, and providing the attributes serving as answers of the question to the user.

The invention also provides a terminal, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the method for constructing the intelligent question-answering system through the question generation data set when executing the computer program.

The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the method for constructing an intelligent question-and-answer system from question generated data sets.

Compared with the prior art, the invention has the following beneficial effects:

the invention searches the information based on the knowledge graph in the tourism field, can greatly reduce ambiguity of some nouns, concentrates on the inquiry in the aspect of tourism and improves the accuracy and efficiency. According to the intelligent question-answering system, the question-answering data set is supplemented by the question generation model based on the intelligent question-answering strategy constructed by the question generation, the condition of travel data loss is made up, and the intelligent question-answering system constructed by the Bert model can return information required by the user more conveniently and accurately. The invention combines the advantages of the Bert model, and constructs the Bert-BilSTM-CRF model on the basis of the Bert model, aiming at better identifying the entity. Meanwhile, the Bert model is applied to the relation extraction, and the purpose of the relation extraction is to identify the relation in the problem and obtain the attribute in the triple. The accuracy rate of text similarity calculation by using the Bert model is superior to that of word2vec models and the like, and the effect is obvious.

Drawings

FIG. 1 is a flow chart of the operation of an embodiment of the present invention to construct a question-answering system.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

step one, establishing a tourist field knowledge map;

using a Python tool to crawl relevant information of required tourism, wherein the relevant information comprises tourist attractions, names, introduction, construction time, scenic spot levels and the like; and cleaning and sorting the crawled data to form a tourism knowledge map.

The whole construction framework of the knowledge graph consists of a logic framework and a technical framework adopted by construction, and the logic framework can be divided into 2 levels of a data layer and a mode layer. Wherein the basic unit of knowledge of the data layer is "fact" (namely a triple of "head entity-relation-tail entity" or "entity-attribute value"); and the mode layer above the data layer manages the knowledge graph in a mode of ontology library, so that the knowledge graph is stored by the mode layer and is highly condensed. The data sources in the knowledge acquisition stage mainly include three types of structured data (such as link data and a relational database), semi-structured data (such as encyclopedia and vertical domain data) and unstructured data (namely text data). The main work of the knowledge integration phase is: due to multiple sources of knowledge acquisition channels, knowledge acquired through knowledge acquisition usually contains a large amount of redundant abnormal information, and therefore the result needs to be further processed and integrated, including operations such as entity linking, knowledge merging, entity alignment and the like.

Specifically, in the first step, a question and answer data set is generated and constructed by using questions, and entity related information is stored in a knowledge graph; identifying entities and relations by natural language processing of questions, and finding out corresponding attribute values in the question and answer data set;

question generation can be viewed as the reverse task of QA, generating a corresponding question given a question, in different QA/QG tasks the answer a can be different, e.g. a sentence in a document or a fact in a knowledge base.

In this task, the KBQA task of the present invention is a triple, so that different methods need to be designed to calculate the corresponding term in the probability formula. To address the challenges of unseen predicates and phrases, the current approach is to train a sequence-to-sequence model that can generate unseen predicates based on triples extracted from the knowledge base. In addition, the QA model can be fine-tuned by inputting the generated questions and the triples extracted from the KB.

The answer in KBQA is a fact from the knowledge base, while the answer in text QA is a sentence in a given document. In this work, the focus should be on KBQA, which is considered a scoring and ranking problem, using a scalar to estimate the correlation between the problem and the triples. The QA task is simplified here to a relationship detection task that takes as input the question and candidate relationships and outputs a relationship that is the maximum probability of the correct relationship, assuming that the subject entity has been detected, so that the answer fact can be queried in the knowledge base once the correct relationship is determined. The task of the QG is to take a sentence or a triple as input and output a question that can be answered by the triple.

Typically such a frame needs to be composed of two part assemblies. The first is a double learning component. It uses the probabilistic correlation between QA and QG to direct the parameters of the QA/QG model to a more appropriate direction. The second is a fine tuning component that improves the ability of the QA model to handle invisible predicates and phrases by combining the QG model with the text corpus and knowledge base triples. The framework is very flexible and does not depend on a particular QA or QG model.

The training of the KBQA system relies on a high quality annotation data set that is not only large-scale, but also unbiased. It is very difficult to build such a data set that covers a large number of triples in the knowledge base.

Therefore, a fine-tuning framework is proposed to supplement the QA dataset, increasing the capacity of the QA model.

QA model

Generally, a Question Answering (QA) model can be simplified into a relationship classification model, and compared with other subtasks such as entity links in KBQA, relationship extraction has a more significant influence on a final result. The entity link accuracy in the existing KBQA method is high, and the performance of relation extraction is not ideal due to invisibility of predicates or paraphrases.

The invention provides a relation extraction model based on a recurrent neural network. To better support invisible relationships, relationship names are decomposed into word sequences, and relationship extraction is described as a sequence matching and ordering task. For example, the relationship location, country, language may be divided into { location, country, language }. Each of the above tokens is converted into a pre-trained word embedding, and then a two-way long-short term memory model (BilSTM) is used to obtain a hidden representation. Finally, the local features are extracted by utilizing the maximum local feature set. The invention uses the same neural network to obtain problem representation, and then uses cosine distance function to calculate similarity. To learn a more general representation at the syntactic level, the present invention replaces the entity with a generic symbol < e >, such as "where is < e > from". However, this mechanism can lose all entity information and in some cases can confuse the model. Thus, the present invention detects the type of entity and concatenates the type representation with the problem representation, which can significantly improve performance.

The model described above is trained using a scoring ranking training method, and the loss function is as follows:

l_rel＝max{0,λ-S_s(q,r⁺)+S_s(q,r^-)}

wherein, question and correct predicate (q, r)⁺) Must be at least the margin λ, S_s(q,r)＝Cosine(h^q,h^r). The candidate relationship R consists of all predicates in q that are relevant to the subject entity e.

QG model

The QG model is a task that generates natural language, inputting triples in a knowledge base. Generating a question involves the subject-to-predicate relationship of a fact whose object represents a valid answer to the generated question. Assuming that the input fact a is { s, p, o }, the QG model approximation generates a problem q is { w }₁,w₂,…,w_nThe conditional probabilities are as follows:

wherein w_＜tRepresenting all previous generated words until the time step reaches t. Under the successful initiation of the sequence learning in the recent machine translation, the QG problem is regarded as a translation task and is solved by adopting the structure of the encoder and decoder. Specifically, the encoder encodes a given fact a ═ { s, p, o } into three fixed-size vectors h_s＝E_fe_s,h_p＝E_fe_pAnd h_o＝E_fe_oIn which E_fIs a KB embedding matrix using TransE learning, e_s，e_p，e_oIs a hotspot vector for s, p, o. Concatenating these three vectors to obtain the coded fact h_f＝[h_s；h_p；h_o]. Then the decoder takes h_fThe questions are generated in order.

Complementary question-answer pairs are generated in a fine-tuning component using the QG model to fine-tune the trained QA model. The tag data that is desired to be supplemented contains predicates and phrases that the QA model does not encounter during the training process. Thereby improving the performance of the QA model.Therefore, the QG model should be able to generate problems given triples and invisible predicates. For each fact a, n pieces of textual evidence D ═ D are collected from the wiki document₁,d₂…,d_n-encoding each text evidence using a set of n-gated recurrent neural networks (GRUs) with shared parameters. The hidden state of the ith word in the j text evidences is calculated as:

wherein E_dIs a pre-trained word embedding matrix,

is d_jThe hot vector of the ith word.

Connecting each hidden state of the text evidence to obtain the final coded text

For the decoder, a GRU is used which has an attention mechanism acting on the evidence of the input text. Given a set of encoded input vectors I ═ h₁,h₂,…,h_kH hidden state before decoder_t-1Attention mechanism calculation of a_t＝{a_i,t,…,a_k,tAs a weight vector, each a_i,tTo determine its corresponding encoded input vector h_iThe weight of (c).

e_i,t＝v_a ^Ttanh(W_as_t-1+U_ah_i)

Wherein v is_a，W_a，U_aIs a trainable weight matrix of the attention module.

Then, for all the marks in all the text evidences:

wherein

Is a scalar value that determines the jth textual evidence d at time t_iThe weight of the ith word.

Preferably, the process of identifying entities and relationships by natural language processing of questions and finding corresponding attribute values in the question-answer dataset comprises: the knowledge graph-based question answering method can be roughly divided into two steps, namely a named entity identification step and an attribute mapping step, most of knowledge graphs in most of the prior art adopt a template-based method due to data quantity, and a Bert model is used in the invention in order to improve the question answering efficiency and accuracy. The Bert model is an NLP model proposed by google in 2018, i.e., Encoder of bidirectional Transformer, because decoder cannot obtain information to be predicted. The main innovation points of the model are all on a pre-train method, namely two methods, namely Masked LM and Next sequence Prediction, are used for capturing the representation at the word level and the Sentence level respectively. The invention divides the whole intelligent question-answering process into two parts. The method uses a Bert-BilSTM-CRF model, and the accuracy of the model in named entity recognition is greatly improved compared with the prior model. And then, judging the attributes of the problems, wherein the attribute judgment and conversion is to solve the problem of text similarity, and a Bert model is still used for performing binary classification. After the Bert model is used, the effect is remarkable.

Further, the main problems to be solved by the question-answering system are named entity identification and relationship extraction. The object of the entity identification step is to find the name of the entity queried in the question, and the object of the relationship extraction step is to find the relevant attribute queried in the question. The current methods for obtaining pre-trained representation models are mainly feature-based methods or fine-tuning methods. A typical feature-based approach is the ELMo model. And based on a fine tuning method, the openAI GPT model only considers the information of the previous word, the vector of the previous word is input into a Transformer model, and L layers are stacked to obtain pre-training representation characteristics. GPT has a strong feature extraction capability, and the effect of ELMo can be achieved without the need for a special design structure for the processing task. For the problem of sentence level, it is necessary to combine the context to accurately grasp its meaning, that is, it is not possible to take into account only the information of the previous word, and it is also necessary to take into account the meaning of the following word. The Bert model is then derived.

The Bert model is a large-scale pre-training language model based on a bidirectional transducer issued by Google, shows striking performance in machine reading understanding top level test SQuAD1.1, comprehensively surpasses human on all two measurement indexes, creates optimal performance in a plurality of NLP tests, and is a great trend when the Bert model is applied to various models.

The Bert model has the following two characteristics: first, this model is very deep, having 12 layers and not wide, with only 1024 intermediate layers, whereas the previous Transformer model intermediate layer has 2048. This also corroborates the view of computer image processing that deep, narrow models are superior to shallow, wide models. Second, the Masked Language Model uses words on both the left and right sides, and the Model considering this point only combines the left-to-right and right-to-left training separately. The author randomized 15% tokens during training, rather than predicting each word over as cbow does. The final loss function only computes which token was dropped by the mask. Structurally, compared with the GPT model, the information of backward words is added, so that the model fully considers the information of the context. In contrast, for the ELMo model, although two bidirectional RNN models are trained, one RNN can only see one direction, and therefore, information in two directions, front and rear, cannot be simultaneously utilized. To solve the problem of being able to use only one-way information, Bert uses Mask language model instead of the ordinary language model. The Mask language model simply gives a sentence, blocks a certain word in the sentence, and predicts the blocked word. Here, 15% of words are randomly masked, and then Bert is used to predict the words of Mask, and the probability of correct model prediction is made as large as possible by adjusting the parameters of the model, which is equivalent to the loss function of cross entropy. Such a Transformer will refer to contextual information when encoding a word. If a token is in the selected 15% tokens, then it is randomly executed as follows: replacement to MASK with 80% probability; a 10% probability of replacement with a random word; 10% probability replaced itself; this has the advantage that Bert does not know which word the MASK replaces, and any word may be replaced, thus forcing the model to be too dependent on the current word when encoding the current time, taking into account its context.

BilSTM-CRF

The present invention utilizes a bi-directional LSTM network that enables efficient use of past and future features within a specified time frame. And (3) taking the word embedding vector of each word in the sentence obtained by the Bert model as the input of each time step of the bidirectional LSTM, and outputting a prediction label corresponding to each word. If the input sentence is composed of 120 words, each word is represented by a 100-dimensional word vector, the input corresponding to the model is (120,100), the hidden vector changes to T1(120,128) after BiLSTM, where 128 is the output dimension of BiLSTM in the model. If the CRF layer is not used, a fully connected layer can be added to the model for classification. And setting target labels of the word segmentation task as B (begin), M (middle), E (end) and S (Single), and finally outputting the vector with the latitude (120,4) by the model. And respectively representing the probability of corresponding BMES for 4 floating point values corresponding to each word, and finally taking the label with high probability as the predicted label. Through a large amount of labeled data and continuous iterative optimization of the model, the method can learn a good word segmentation model. Although relying on the powerful nonlinear fitting capability of neural networks, good models can be theoretically learned. The above-described BiLSTM model only takes into account context information on the tags. For the sequence labeling task, the label L _ t of the current position has a potential relation with the previous position L _ t-1 and the next position L _ t + 1. Therefore, most of the current cases follow the BilTM model with a CRF layer for the entire BiLSTM modelThe optimal tag sequence is learned on the sequence. For an input sentence X ═ X₁,x₂…,x_n) Let P be the fractional matrix of the output of the BilSTM network, the size of P being n x k, where k is the number of different labels, P_i,jThe score of the jth token corresponding to the ith word in the sentence. Followed by the predicted sequence y ═ y (y)₁,y₂…,y_n) Then the score is defined as:

where A is a matrix of conversion scores, thus A_i,jRepresenting the conversion score, y, from token i to token j₀And y_nAre the beginning and ending tokens of a sentence, which the present invention adds to the set of possible tokens. Thus a is a square matrix of size k + 2.

It can be seen that the score for the entire sequence is equal to the sum of the scores for the individual positions, each of which is obtained in two parts, one part being the p output by the LSTM_iAnd the other part is determined by a transfer matrix A of the CRF model.

Then the normalized probability can be obtained using Softmax:

during training, the log probability of the correct tag sequence is maximized:

wherein Y is_XRepresenting all the marker sequences of sentence X. From the above description, it is clear that the present invention contemplates that the network architecture of the present invention generates efficient output tag sequences. In decoding, the present invention predicts the output sequence by (2):

since only binary interactions between outputs are modeled, dynamic programming can be used to compute the maximum a posteriori sequence y in sum (1) and in sum (2)^*。

Bert-BiLSTM-CRF

The model is composed of three parts. Firstly, a Bert pre-training language model is utilized to encode a single character to obtain a word vector corresponding to the single character, then a BilSTM layer is utilized to carry out bidirectional encoding on an input text, finally, a semantic vector containing context information is input into a CRF layer to be decoded, and the CRF layer outputs a label sequence with the maximum probability to obtain the category of each character.

The method for generating and constructing the question and answer data set through the questions comprises the following specific steps of:

firstly, the QA and QG models are jointly trained by utilizing the probability correlation between the QA and the QG, and different methods are used for calculating corresponding items in the probability formula.

And step two, in order to solve the problem of invisible predicates, a fine adjustment component is used, a sequence-to-sequence model is trained by using copy operation and text evidences from Wikipedia, and the model can generate the invisible predicates according to triples extracted from a knowledge base.

And step three, optimizing the QA model by providing the generated problems and the triples extracted from the knowledge base, wherein the QA model is finely adjusted by inputting the generated problems and the triples extracted from the knowledge base.

Referring to fig. 1, the specific steps of constructing the question-answering system of the present invention are:

step one, analyzing a natural language question provided by a user; and performing word segmentation and word vector training on the question sentence. In the word segmentation process, a jieba tool is used, and a preset tourism field special dictionary is added, so that the word segmentation accuracy is improved.

And step two, carrying out named entity identification and relationship identification by using a Bert + BiLSTM + CRF model.

Thirdly, inquiring corresponding entities in the knowledge graph by using the identified entities and relations; if the knowledge graph has a matched entity, selecting the entity; and if the knowledge graph does not have the matched entity, selecting the closest entity by utilizing semantic similarity calculation. The invention utilizes a cosine similarity algorithm to calculate the semantic similarity, and the cosine similarity calculation is expressed as follows:

wherein X represents an entity extracted from the question, Y represents an entity in the knowledge graph, and n represents the dimension of the word vector. And selecting the entity in the knowledge graph with the highest similarity as matching by calculating the similarity.

Step four, matching the selected entities and attributes with triples in the knowledge graph;

and step five, returning the corresponding attribute value as an answer of the question to the user.

The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to perform the method of the invention. The terminal can be a desktop computer, a notebook, a palm computer, a cloud server and other computing equipment, and can also be a processor and a memory.

The processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the device for screening the wiring relation of the characteristic values of the backplane signals by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the technical solution of the present invention, and it should be understood by those skilled in the art that the technical solution can be modified and replaced by a plurality of simple modifications and replacements without departing from the spirit and principle of the present invention, and the modifications and replacements also fall within the protection scope covered by the claims.

Claims

1. A method for constructing an intelligent question-answering system through question generation data sets is characterized by comprising the following steps:

step one, establishing a tourist field knowledge map;

2. The method for constructing an intelligent question-and-answer system through question generation data sets according to claim 1, wherein the constructed travel domain knowledge graph constructs question-and-answer data sets using question generation, wherein the task of question generation is to generate corresponding natural language questions given input answers, and wherein question answering is a reverse task of finding suitable answers in given questions.

3. The method for constructing an intelligent question-answering system through question generation data sets according to claim 1, wherein in the third step, a Bert pre-training language model is used to encode a single character to obtain a word vector corresponding to the single character, a BilSTM layer is used to encode an input text bidirectionally, and finally a semantic vector containing context information is input into a CRF layer to be decoded, wherein the CRF layer can output a label sequence with the maximum probability to obtain the category of each character.

4. The method for constructing an intelligent question-answering system through question generation data sets according to claim 1, wherein the fourth step utilizes a cosine similarity algorithm to perform semantic similarity calculation, the cosine similarity calculation being expressed as follows:

5. The method for constructing an intelligent question-answering system through question generation data sets according to claim 1, characterized in that QG and QA models are combined, and by means of knowledge base and text corpus, the QA and QG models are first trained jointly on gold data sets, and then the QA models are fine-tuned by means of text by means of supplementary data sets constructed by the QG models.

6. The method according to claim 1, wherein the information related to the entity in step one is crawled by using Python tool, the information related to the entity comprises tourist attractions, names, profiles, construction time and scenic spot levels, and the crawled data is cleaned and collated to form a tourist knowledge map.

7. An apparatus for constructing an intelligent question-answering system from question-generating data sets, comprising:

8. A terminal comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, carries out the steps of the method of building an intelligent question-answering system from question-generating data sets according to any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that: the computer program realizing the steps of a method of building an intelligent question-answering system from question-generating data sets as claimed in any one of claims 1 to 6 when executed by a processor.