CN112256847B

CN112256847B - Knowledge base question-answering method integrating fact texts

Info

Publication number: CN112256847B
Application number: CN202011056492.8A
Authority: CN
Inventors: 余正涛; 王广祥; 郭军军; 相艳; 黄于欣; 线岩团
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2023-04-07
Anticipated expiration: 2040-09-30
Also published as: CN112256847A

Abstract

The invention relates to a knowledge base question-answering method fusing fact texts, and belongs to the field of natural language processing. The invention analyzes the triples of the natural language questions and the candidate answers respectively, converts the entities, the entity types and the relations in the triples into the fact texts, maps the natural language questions and the fact texts into numerical vectors in a low-dimensional semantic space through a pre-training language model BERT, calculates and sequences by adopting cosine similarity, establishes a knowledge base question-answering method model fusing the fact texts, and can learn the score relation between the triples of the natural language questions and the candidate answers so as to find the answers which are most similar to the semantics of the natural language questions in a knowledge base.

Description

Knowledge base question-answering method integrating fact texts

Technical Field

The invention relates to a knowledge base question-answering method fusing fact texts, and belongs to the field of natural language processing.

Background

In recent years, with the rapid development of large-scale knowledge bases such as Freebase, DBpedia and YAGO, knowledge base questioning has received increasing attention from researchers. A typical knowledge base mostly adopts "head entity-relationship-tail entity" triples as basic units to form a graph structure, and each triplet is called a fact, for example (Zhang San, peer. The number of triplets in the knowledge base is thousands, and valuable information is difficult for users to obtain. The knowledge base question-answer can directly provide natural language question answers according to facts in the knowledge base, and a mode of directly accessing the knowledge base is provided. Such as the given problem "Where is Zhang San's birthplace? "the goal is to give the answer" Shang Hai "directly by querying the triplets (Zhang San, pest.

The main task of knowledge base question-answering is to complete query and matching on structured data according to the semantics of natural language questions, and the most effective way to complete the task is to convert the questions into query statements that can be understood and executed by a machine, such as: SQL, SPARQL, etc.; with the rapid development of deep learning technology, knowledge base question-answer methods based on representation learning are receiving more and more attention, and the core of the method is to calculate the correlation between questions and candidate answer entities through representation learning. The methods have achieved better effects, but the question and answer of the knowledge base are not solved, and the most important challenge is that the natural language problem and the triples in the knowledge base have semantic gap, which is specifically expressed as follows: the relationship of triples has various expressions in natural languages, such as: the relationship "place _ of _ birth" in the triple can be represented by natural language "method", "born in", etc.; there is ambiguity in the entity names of triples, such as: zhang three and Large both represent the entity "Zhang San". Therefore, how to improve the semantic gap becomes one of the difficulties and key technologies of the knowledge base question-answering task.

Disclosure of Invention

The invention provides a knowledge base question-answering method fusing a factual text, which can encode information of candidate answer triples to a great extent, also retains all information of natural language questions, and two finally obtained numerical vectors contain complete question semantic information and more candidate answer triples, so that the matching of the correlation degrees of the two is more facilitated to find the final answer.

The technical scheme of the invention is as follows: a knowledge base question-answering method fusing fact texts comprises the following steps:

step1, identifying a subject entity;

step2, candidate answer retrieval: searching candidate answers in a knowledge base Freebase according to the subject entities;

step3, generation of fact text: converting each candidate answer into a fact text which can be represented by text;

step4, constructing a knowledge base question-answer model: and expressing the natural language question and the candidate answer into numerical vectors by adopting BERT, calculating the similarity score between the vectors and selecting the candidate with the highest score as the final answer.

As a preferred embodiment of the present invention, in Step1, a model based on a bidirectional long and short memory network bilst is used to execute a topic entity recognition task, a word is mapped into a word vector by using a pre-trained Glove model, and the probability of whether the word is a topic entity word is obtained through the bilst, a full link layer and Softmax.

As a preferred embodiment of the present invention, step2 searches the fact that the constraint condition is satisfied from the knowledge base according to the identified subject entity as a candidate answer.

As a preferable embodiment of the present invention, the constraint conditions are: if the head entity of a triple in the knowledge base belongs to the subject entity of the question, the triple is listed as a candidate answer.

As a preferred scheme of the present invention, in Step3, the entity relationship and the type in the triple corresponding to the candidate answer obtained in Step2 are constructed into a fact text by the mode "question word + tail entity type + relationship + head entity".

As a preferred embodiment of the present invention, in Step4, a multi-layer bidirectional Transformer encoder-BERT is selected to respectively represent and learn the natural language question and the fact text generated in Step3 to obtain corresponding vector representations, then the cosine similarity is used to calculate the similarity of the vector representations of the question and the fact text as the scores of the candidate answers, the scores are numerically ranked, and the candidate answer with the highest score is the final answer.

The invention has the beneficial effects that:

the invention converts the representation learning of the knowledge base triples into the text representation learning which is the same as the natural language question through the designed mode, and adopts the BERT to carry out the text representation learning, thereby not only greatly encoding the information of the triples, but also improving the semantic gap between the natural language question and the knowledge base triples by utilizing the rich semantic mode of the BERT, and effectively improving the effectiveness and the robustness of the knowledge base question-answer. The model proposed by the invention aims to solve the simple problem and achieve better results on the simple problem data set.

Drawings

FIG. 1 is a knowledge base question-answering system framework incorporating factual text as set forth in the present invention;

FIG. 2 is a subject entity recognition model proposed by the present invention;

fig. 3 is a BERT structure based on a twin network architecture proposed by the present invention.

Detailed Description

Example 1: as shown in fig. 1-3, a knowledge base question-answering method fusing fact texts comprises the following specific steps:

step1, topic entity identification: identifying a subject entity in a natural language question input into the system through a subject entity identification model;

the subject entity refers to a knowledge base entity mentioned in the natural language question Q, such as in the question "Where is Zhang San's birthdance? "the corresponding entity of" Zhang San "in the knowledge base is the subject entity of the problem. In our method, a model based on a bi-directional recurrent neural network (e.g., bilSTM) is employed to perform topic entity recognitionAnd (5) transaction. Model as shown in fig. 2, a natural language question Q = w is given that contains n words ₁ ,w ₂ ,…,w _n Firstly, mapping n words into a word vector { x ] through a pre-trained Glove model _j }, where j =1, \8230;, n; then learning forward hidden states using BilSTM

And a backward hidden state>

The calculation formulas of the backward hidden state are shown in (1) - (6).

/>

Wherein i _j The input gate determines which information the LSTM stores; f. of _j If the LSTM is a forgetting gate, determining which information is lost by the LSTM; o _j Is an output gate, which determines the information to be output by the LSTM unit; sigma represents a sigmoid function; . Representing the Hadamard product.

The forward hidden state and the backward hidden state are connected in series to obtain

h _j Entering a full connection layer and obtaining a two-dimensional vector(s) after passing through Softmax>

The first dimension represents the probability that the word is a subject entity word, and the second dimension represents the probability that the word is not a subject entity word; and associating the recognized words or phrases of the subject entity in the knowledge base, and returning the entities which are the same as or contain the words (phrases) of the subject entity in the knowledge base as the subject entity.

intuitively, given a natural language question, all entities in the knowledge base are candidate answers. However, the number of entities in the knowledge base is usually very large, such as 6800 million entities in Freebase, and using all entities as candidate answers is both computationally expensive and inefficient. In practice, entities identified according to the topic entity identification model in Step1 are usually used for searching for the fact meeting the constraint conditions from the knowledge base as candidate answers, and the problem mainly solved by the method is a simple problem, so the constraint conditions are as follows: if the head entity of a triple in the knowledge base belongs to the subject entity of the question, the triple is listed as a candidate answer.

Step3, fact text generation: converting each candidate answer into a fact text which can be represented by text;

each candidate triple comprises a head entity, a relation and a tail entity, the tail entity can find a corresponding type in the triple, and by analyzing the information, the information contained in the head entity, the relation and the tail entity type and the information contained in the natural language question corresponding to the head entity, the relation and the tail entity type have high overlap ratio, such as: "Where is Zhang San's birthplace? "middle" Zhang San "corresponds to the head entity" Zhang San "in the triplet; "birthdaplace" corresponds to the relationship "place _ of _ birth" in the triple; "Where is" is a type "city town" that asks questions about a place, corresponding to the tail entity "Shang Hai" in the triplet. Therefore, each candidate triple is constructed into a fact text through a mode of ' question words + tail entity types + relations + head entities ', and the fact text corresponding to the example is ' Where city town place of birth Zhang San? "such factual text may be syntactically inaccurate, but retains much of the information contained in the candidate triples.

According to the method, firstly, a fact text is generated through a mode of 'question words + tail entity types + relations + head entities' according to answer facts corresponding to natural language questions, then, the 'natural language question-fact text' pair is labeled by 0 to 5 points, the labeling does not need manual operation, and a part of the 'tail entity types, relations and head entities' in the fact text is randomly replaced to generate 'natural language question-fact text' pairs with different scores. Score 5, no random substitution, indicates that this fact is the answer to the question; 4, randomly replacing the tail entity type to show that the fact head entity and the relation are correct but the tail entity is wrong; 3, randomly replacing the tail entity type and the head entity to show that the fact relationship is correct but the tail entity and the head entity are wrong; 2, randomly replacing the relation, namely representing that the fact tail entity and the fact head entity are correct but the relation is wrong; 1, randomly replacing the tail entity type and the relationship to show that the fact head entity is correct but the tail entity type and the relationship are wrong; the score 0 randomly replaces the tail entity type, the relationship and the head entity, and indicates that the tail entity type, the relationship and the head entity of the fact are all wrong, and the specific examples are shown in table 1. Finally, 455460 pieces of training data and 65070 pieces of verification data are generated in a total of 0 to 5 points through random replacement operation, and the data amount of each point is the same.

TABLE 1 natural language question-fact text pair labeling method

Step4, constructing a question-answer model of the knowledge base: analyzing the natural language question and the fact text in the candidate answer generated by Step3, the knowledge base question-answer is converted into a semantic matching question between sentences. Coding each fact text and the corresponding question into vectors for representation, then calculating and sequencing the similarity among the vectors, and finally taking the triple corresponding to the fact text with the highest similarity as the final answer of the question.

Step4.1, BERT-based text representation: the invention uses BERT to carry out expression learning of sentences, wherein the BERT is a multilayer bidirectional Transformer encoder and can better acquire semantic information of texts, and the process is shown in formulas (7) to (9):

X _i W _em ＝E _i (7)

Q,K,V＝f _qkv (E _i ) (8)

wherein W _em Is an embedded matrix; d is a radical of _k Is the dimension of the K vector;

is a feature vector containing context information. Pooling the output layer of the BERT with an averaging strategy to obtain a problem vector V of fixed size _q And a fact text vector V _a . A twin network architecture is adopted in the model, so that the model can efficiently perform semantic similarity search, as shown in fig. 3.

Step4.2, candidate answer ranking: and (3) performing similarity calculation on the vector representation of the question and the fact text by using cosine similarity, wherein the calculation formula is shown as formula (10):

wherein V _qi Represents a problem vector V _q The ith eigenvalue of (a); v _ai Representing a factual text vector V _a The ith characteristic value of (1). S (V) _q ,V _a ) Has a value range of [0,1 ]]A0 indicates that the question is completely irrelevant to the fact text semantics and a 1 indicates that the question is the same as the fact text semantics. Calculating the cosine similarity S (V) of each fact text and the corresponding problem _q ,V _a ) And numerically ordering the scores as scores of the candidate answers, wherein the candidate answer with the highest score is the final answer.

To illustrate the performance of the present invention, the best recently proposed methods were chosen for comparison as baselines, and these methods were validated based on the simplequest dataset, including: memory network (memory network) based methods proposed by Bordes et al; the Convolutional Neural Network (CNN) method proposed by Yin et al; an attention-based (attention-based) codec (encoder-decoder) method proposed by Golub and He; the gated recursive unit neural network (GRU) method proposed by Lukovnikov et al; mohammed et al consider the prediction of relationships as a classification problem and solve it using different neural networks; huang et al convert the text space vector of the question to the knowledge base space vector and represent the learning model (e.g., transE) using pre-trained knowledge.

TABLE 2 results on SimpleQuestion dataset

The experimental results are shown in table 2, and it can be seen from the results in the table that the final accuracy of the present invention is 76.3%, which is improved compared to all baseline models, 21.69% compared to Bordes et al, and 1.19% compared to Huang et al.

The text representation learning of the invention is realized by BERT, and in order to verify the improvement of the BERT on the model performance, the BERT is replaced by a Convolutional Neural Network (CNN) and a bidirectional Long Short-Term Memory network (BilSTM) respectively for carrying out experiments.

TABLE 3 BERT effectiveness test results

The results are shown in Table 3, where the model using BERT is 9.78% and 12.21% better than CNN and BilSTM, respectively, indicating that the performance improvement of the present invention is greater using BERT.

Then, the robustness of the invention is further verified, in order to ensure that the relation type appearing in the test phase never appears in the training verification phase, we do not use any data related to SimpleQuestion in the model training phase, but use the common data set SNLI and STS benchmark (STSb) of the semantic similarity task, wherein SNLI is a data set containing 570000 sentence pairs and carrying a composition, an entielment and a neutral label; STSb is a data set comprising 8628 sentence pairs from the three categories subtitle, news and forum, each sentence pair being labeled with a similarity score between 0 and 5 points. For SNLI, the BERT is trimmed using the cross entropy loss as the target function, the result of SNLI trimming is used as the initialization parameter of BERT, and trimming continues on STSb with the variance loss as the target function. Huang et al, different from our method, selects Huang et al as the baseline, and recombines and divides SimpleQuestion according to the type of the relationship, so that the relationship type in the testing stage never appears in the training and verification stage.

Table 4 robustness test results

The experimental results are shown in table 4, and the accuracy of the invention reaches 51.3% when the relation types never appeared in the training stage are processed, and is improved by 22.72% compared with the baseline model Huang et al.

From all the above experiments it can be seen that: compared with other baseline models, the method has the advantages that better results are obtained, and the performance reaches 76.3%; the method has strong robustness, and improves the problems of natural language and semantic gap of triples in a knowledge base to a great extent; BERT is the key to the invention in improving effectiveness and robustness.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A knowledge base question-answering method fusing fact texts is characterized by comprising the following steps:

step1, identifying a subject entity;

step4, constructing a question-answer model of the knowledge base: representing the natural language question and the candidate answer into numerical vectors by adopting BERT, calculating the similarity score among the vectors and selecting the candidate with the highest score as the final answer;

step2, searching the fact meeting the constraint condition from a knowledge base according to the identified subject entity as a candidate answer;

the constraint conditions are as follows: if the head entity of the triple in the knowledge base belongs to the subject entity of the question, the triple is listed as a candidate answer;

in Step3, the entity relationship and the type in the triple corresponding to the obtained candidate answer are constructed into a fact text through a mode of 'question word + tail entity type + relationship + head entity'.

2. The knowledge base question-answering method fusing fact texts according to claim 1, wherein: in the Step1, a model based on a bidirectional long and short memory network (BilTM) is adopted to execute a topic entity recognition task, a word is mapped into a word vector by using a pre-trained Glove model, and the probability of whether the word is a topic entity word is obtained through the BilTM, a full connection layer and Softmax.

3. The knowledge base question-answering method fusing fact texts according to claim 1, wherein: in the Step4, a multi-layer bidirectional Transformer encoder-BERT is selected to respectively express and learn the natural language question and the fact text generated in the Step3 to obtain corresponding vector representations, then the cosine similarity is used to calculate the similarity of the vector representations of the question and the fact text to serve as scores of candidate answers, the scores are numerically ranked, and the candidate answer with the highest score is the final answer.