CN112115238B

CN112115238B - Question-answering method and system based on BERT and knowledge base

Info

Publication number: CN112115238B
Application number: CN202011177960.7A
Authority: CN
Inventors: 廖伟智; 黄明彤; 阴艳超
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-11-15
Anticipated expiration: 2040-10-29
Also published as: CN112115238A

Abstract

The invention discloses a question-answering method and a question-answering system based on BERT and a knowledge base, which are applied to the field of information retrieval and aim at the defects of the existing question-answering system based on the knowledge base; and training the two models, processing the question corpus to be answered by adopting the two trained models, obtaining the correct answer of the question, and automatically rewriting the answer.

Description

Question-answering method and system based on BERT and knowledge base

Technical Field

The invention belongs to the field of information retrieval, and particularly relates to a question and answer searching technology.

Background

The traditional question and answer search is based on keyword retrieval, and does not consider semantic information of question texts. In the knowledge base question-answering system, a questioner inputs a specific question text, analyzes and processes the question text on line, and then retrieves and outputs a best matching answer text to obtain a quick and accurate answer to a question.

The knowledge base question-answering system and method are mainly divided into three categories:

1) Information retrieval-based method

By extracting the question entity and the attribute relation from the question text, the question text is then searched in a knowledge base.

2) Method based on semantic analysis

And searching a logic expression of the question text in a knowledge base to obtain a search result and converting the search result into an answer.

3) Deep learning-based method

And preprocessing the problem text to obtain vectorized input, mapping the triple text in the knowledge base to a vector space, and analyzing and calculating the similarity to obtain a triple result with the highest similarity.

The prior art has the defects that:

1. with semantic analysis based approaches, there are obstacles between logical expressions and natural language semantics;

2. the method based on information retrieval cannot analyze semantic information in the problem text, and particularly cannot fully utilize context information to obtain ambiguity elimination of an entity;

3. the existing models such as CNN, RNN, bi-LSTM and the like have no good model training effect, accuracy, F1 value and the like of BerT, transformer and other leading edge models, and are lack of correlation analysis on words or words in problem texts.

Disclosure of Invention

In order to solve the technical problems, the invention provides a question-answering method and a question-answering system based on BERT (Bidirectional Encoder retrieval from Transformers) and a knowledge base.

One of the technical schemes adopted by the invention is as follows: a question-answering method based on BERT and a knowledge base comprises the following steps:

A. acquiring question and answer corpora used for constructing a knowledge base and used for BERT downstream task training, and preprocessing the question and answer corpora;

B. b, constructing a question-answer knowledge base according to the question-answer corpus preprocessed in the step A;

C. b, constructing a language model based on BERT according to the question and answer corpus preprocessed in the step A;

D. acquiring the training question-answer corpus data of the BERT language model according to the step C, and labeling to form a labeled corpus;

E. constructing a named entity recognition model based on BERT-CRF and the language model according to the BERT language model obtained in the step C and the preprocessed labeled corpus in the step D;

F. constructing a text similarity two-classification model based on the BERT and the language model according to the BERT language model obtained in the step C and the preprocessed labeled corpus in the step D;

G. respectively training by using the BERT-CRF (Conditional Random Fields) model obtained in the step E and the text attribute binary classification models of the BERT and the language model obtained in the step F by using the labeled linguistic data to respectively obtain a BERT-CRF language model with parameter weight and a BERT text similarity binary classification model;

H. and (B) obtaining a BERT-CRF language model with parameter weight and a BERT text similarity two-classification model by utilizing the E, F, processing the linguistic data of the question to be answered by combining the question and answer knowledge base obtained in the step B to obtain the correct answer of the question, and automatically rewriting the answer.

The question-answer corpus preprocessed in the step A comprises the following steps: the system comprises an entity labeling data set, a sample set and a ternary array set, wherein the sample set is used for matching sentence similarity and is obtained according to the entity labeling data set, and the ternary array set comprises a question entity, an attribute entity and answer text.

And step B, constructing a question-answer knowledge base by combining the ternary arrays.

The second scheme adopted by the invention is as follows: a question-answering system based on BERT and knowledge base comprises a question text input module, a question text vectorization module and a question text output module, wherein the question text input module is used for inputting a question text and vectorizing the text; the BERT-CRF named entity recognition module is used for carrying out named entity recognition on the problem text and recognizing the problem entity; the knowledge base retrieval module is used for retrieving the problem entities to obtain candidate triple entities, feeding back the candidate attributes to the BERT text attribute identification module, and combining the best attributes fed back by the BERT text attribute identification module with the problem entities to obtain the final best triple; the BERT attribute identification module is used for performing correlation analysis on the candidate attributes and the problem text to obtain the optimal attributes, and feeding the optimal attributes back to the knowledge base; and the answer generating module is used for rewriting the optimal triple obtained by the knowledge base searching module into an answer text and outputting the answer text to the questioner.

The invention has the beneficial effects that: the invention relates to a question-answering method and a question-answering system based on BERT and a knowledge base, which are characterized in that a BERT-CRF named entity model and a BERT text similarity binary classification model are combined, a multi-head attention mechanism is utilized, the relation between words or between words is better utilized, semantic representation of more layers is obtained through BERT word embedding, wherein the average F1 value of the BERT-CRF tested on an NLPCC-ICCPOL 2016KBQA public data set by the BERT-CRF named entity recognition model reaches 99.4%, the recognition accuracy in the question-answering process is improved, and more accurate answer is obtained by combining with a retrieval knowledge base.

Drawings

FIG. 1 is a flow chart of a protocol of the present invention;

FIG. 2 is a diagram illustrating an overall architecture of a BERT pre-training language model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a bidirectional Transformer layer according to an embodiment of the present invention;

FIG. 4 is a named entity recognition model based on BERT-CRF and language model provided by the embodiment of the invention;

FIG. 5 is a two-classification model of text similarity based on BERT and language models provided by an embodiment of the present invention;

fig. 6 is a block diagram of a query-answering system based on BERT and a knowledge base according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

As shown in fig. 1, the question-answering method based on BERT and knowledge base of the present invention includes the following steps:

B. b, constructing a question-answer knowledge base according to the question-answer corpus preprocessed in the step A, wherein a triple is composed of a question entity, an attribute entity and an answer text and is stored as the question-answer knowledge base;

D. c, obtaining training question and answer corpus data of the BERT language model according to the step C, and labeling to form a labeled corpus;

F. b, according to the BERT language model obtained in the step C and the pre-processed labeled corpus in the step D, constructing a text similarity two-classification model based on the BERT and the language model;

G. respectively training by using the BERT-CRF model obtained in the step E and the text attribute binary classification models of the BERT and the language model obtained in the step F by using the labeled corpora to respectively obtain a BERT-CRF language model with parameter weight and a BERT text similarity binary classification model;

H. and (4) obtaining a BERT-CRF language model with parameter weight and a BERT text similarity two-classification model by utilizing E, F, processing the question corpus to be answered by combining the question and answer knowledge base obtained in the step (B) to obtain the correct answer of the question, and automatically rewriting the answer.

In the step A, question and answer corpora used for constructing a knowledge base and used for BERT downstream task training are obtained and preprocessed. The method specifically comprises the following steps:

a1, dividing original question-answer pair data into a training set, a verification set and a test set, wherein each pair of data comprises four components of a question text, a question entity, an attribute entity and an answer text;

raw data example:

what types of patents are? (question text); patents (problem entities); type (attribute entity); invention, utility model and appearance design (answer text);

a2, automatically generating a training set, a verification set and a test set from the original question and answer to data to generate entity marking data, namely constructing a sample set for training entity identification, and constructing an entity identification training set, a verification set and a test set, wherein an entity sequence is marked for training a BERT-CRF model;

a3, constructing a sample set structure attribute association training set, a verification set and a test set used for matching sentence similarity through data in the entity recognition training set, the verification set and the test set in the step A2, and using the sample set structure attribute association training set, the verification set and the test set for a two-classification task, namely training a BERT two-classification model;

and A4, processing original data used for constructing a question-answer knowledge base, and processing the original data containing question texts, question entities, attribute entities and answer texts into a clean triple data set, wherein the processed triple data set comprises the question entities, the attribute entities and the answer texts.

Triple data example:

{ (question entity), (Attribute entity), (answer text) }

{ patent, type, invention, utility model and design }

And loading and storing the processed triple data set in the step A4 into a database.

And (D) constructing a BERT-based language model through the question and answer corpus preprocessed in the step (A) as described in the step (C). The method comprises the following steps:

constructing a BERT pre-training language model, wherein the model has strong language feature extraction capability and is convenient for downstream tasks to extract features on line, the overall architecture of the BERT pre-training language model is shown in figure 2, and the construction process comprises the following sub-steps:

c1, constructing an Embedding layer, wherein the Embedding layer is formed by summing three types of Embedding (Token Embedding, segment Embedding and Position Embedding):

token entries are word vectors, the first word is the CLS Token, and can be used for subsequent classification tasks

Segment Embeddings are used to distinguish two sentences because pre-training does not just do LM but also do classification tasks with two sentences as input

Position Embeddings are learned from trigonometric functions

C2, masked LM, for training deep bi-directional language representation, i.e. masking a part of the original corpus, and then predicting the Masked part of words or characters, and predicting 15% of the characters in each sentence of random mask by using the context thereof, wherein 80% is using [ mask ], for example, "what types are patent? "→" which are the patents [ mask ] [ mask ]? ",10% are words that randomly take a word in place of a mask," what types are patents? "→" where are the patents? ",10% remain unchanged," what types are patents? "→" what types are there in the patent? ".

C3, constructing a bidirectional Transformer layer structure, which is a depth network based on a self-attention mechanism, wherein the structure is shown in FIG. 3;

the key part of the layer structure is a self-attention mechanism, and the word characterization is obtained mainly by adjusting a weight coefficient matrix according to the association degree between words in the same sentence:

wherein: q represents a Query vector of the table, K represents a Key vector, V represents a Value vector,

r represents a set of overall real numbers, d _k Is the input vector dimension of Q, K,

for the penalty factor, the characters or words in a sentence are related through a self-attention mechanism, and the relevance of different words or characters in a sentence is expressed to a certain extent.

Each sub-layer (the self-attention mechanism layer and the feedforward neural network layer) is connected with a residual Add module and a Normailize layer normalization module, namely the Add and Normailize layers in FIG. 3, residual connection is used for solving the problem of difficulty in network training, layer normalization is performed on the last dimension, the phenomenon that the value in the layers is changed too much can be prevented, the training process of the models is accelerated, and the models can be converged more quickly.

And D, acquiring the training question-answer corpus data of the BERT language model, and labeling to form a labeled corpus.

And the BIO labeling is adopted for the part of the corpus identified and processed by the D1 and BERT-CRF entities, so that only a problem entity needs to be labeled, a plurality of entity types do not need to be labeled, and a word-based BIO labeling is uniformly used. Example (c):

proprietary/beneficial/owned/which/some/class/type? → B-NER/I-NER/O/O/O/O/O

And D2, marking the training corpus of the BERT attribute similarity model, adopting 0 and 1 marking, and simultaneously randomly and automatically sampling 5 negative samples, wherein the example is as follows and comprises 'question + attribute + 0/1'.

Step E, constructing a named entity recognition model based on BERT-CRF and the language model according to the BERT language model obtained in the step C and the preprocessed labeled corpus in the step D, and as shown in FIG. 4, the method comprises the following steps:

e1, constructing an entity identification model for a downstream entity identification task, wherein the BERT principle is the same as the step C.

And E2, a CRF layer obtains a global optimal label sequence by considering the adjacent relation between the labels, is used for segmenting and marking sequence data, and is a discriminant method for predicting an output sequence according to an input sequence. The application of CRF to named entity recognition is to define an evaluation score calculation formula given the text sequence X = { X1, X2, · ·, xn } that needs to be predicted, and the output prediction sequence Y = { Y1, Y2, ·, yn } of the BERT model, as follows:

wherein W represents a label migration matrix, W _i,j Indicates that the label i is shifted to the fraction of label j, n is the sequence length,

y represents the position _i Score of each label.

The P-calculated probability formula represents the corresponding probability of the original sequence based on the predicted sequence.

f1, constructing a Bert downstream task for attribute similarity training and testing problem attributes. The structure is shown in fig. 5.

Step G, respectively training by using the BERT-CRF model obtained in the step E and the text attribute binary classification models of the BERT and the language model obtained in the step F by using the labeled linguistic data to respectively obtain a BERT-CRF language model with parameter weight and a BERT text similarity binary classification model;

and step H, obtaining a BERT-CRF language model with parameter weight and a BERT text similarity two-classification model by utilizing E, F, processing the linguistic data of the question to be answered by combining the question and answer knowledge base obtained in the step A to obtain the correct answer of the question, and automatically rewriting the answer.

H1, extracting entities on line by a BERT-CRF model with parameter weight for the question text, and inquiring a knowledge base to obtain candidate triples { (question entities), (attribute entities), (answer text) }.

And H2, performing relevance prediction on the attribute entity and the problem text by the problem text through a BERT text attribute similarity binary classification model to obtain matching with a label of 1.

H3, obtaining accurate triple texts, rewriting correct answers, and outputting answers of the questions.

FIG. 6 shows a system portion of the present invention, comprising: the question text input module is used for inputting a question text and vectorizing the text; the BERT-CRF named entity recognition module is used for carrying out named entity recognition on the question text and recognizing the question entity; the knowledge base retrieval module is used for retrieving the problem entities to obtain candidate triple entities, feeding back the candidate attributes to the BERT text attribute identification module, and combining the best attributes fed back by the BERT text attribute identification module with the problem entities to obtain the final best triple; the BERT attribute identification module is used for carrying out correlation analysis on the candidate attributes and the problem text to obtain the optimal attributes, and feeding the optimal attributes back to the knowledge base; and the answer generating module is used for rewriting the optimal triple obtained by the knowledge base searching module into an answer text and outputting the answer text to the questioner.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A question-answering method based on BERT and a knowledge base is characterized by comprising the following steps:

C. b, constructing a language model based on BERT according to the question and answer corpus preprocessed in the step A; the step C comprises the following sub-steps:

c1, constructing an Embedding layer, wherein the Embedding layer is formed by summing three types of Embedding, and the three types of Embedding comprise: token Embeddings, segment Embeddings, position Embeddings;

c2, masked LM, used for training the language representation of the deep two-way, specifically: covering a part of original linguistic data, then predicting the covered part of words or characters, randomly predicting 15% of characters in each sentence of a mask by using the context of the characters, wherein in all the original linguistic data, 80% of the characters adopt 'mask', 10% of the characters randomly take one word to replace the word of the mask, and the remaining 10% of the characters are kept unchanged;

c3, constructing a bidirectional Transformer layer structure based on a self-attention mechanism;

2. The question-answer method based on the BERT and the knowledge base as claimed in claim 1, wherein the question-answer corpus preprocessed in the step A comprises: the system comprises an entity labeling data set, a sample set and a ternary array set, wherein the sample set is used for matching sentence similarity and is obtained according to the entity labeling data set, and the ternary array set comprises a question entity, an attribute entity and answer text.

3. The BERT and knowledge-base based question-answering method according to claim 1, wherein the question-answering knowledge base is constructed in the step B by combining ternary arrays.

4. The BERT and knowledge-base based question-answering method according to claim 1, wherein the self-attention mechanism adjusts the weight coefficient matrix to obtain the word characterization according to the association degree between words in the same sentence:

wherein: q represents a Query vector, K represents a Key vector, and V representsValue vector, d _k Is the input vector dimension of Q, K,

is a penalty factor.

5. The BERT and knowledge-base based question-answering method according to claim 1, wherein step D comprises:

d1, marking the corpus part of the BERT-CRF entity identification processing by adopting BIO;

and D2, marking the BERT attribute similarity model training corpus, and adopting 0 and 1 for marking.

6. A BERT and knowledge base based question-answering system comprising: the system comprises a question text input module, a BERT-CRF named entity identification module, a knowledge base retrieval module, a BERT attribute identification module and an answer generation module;

the construction process of the BERT-CRF named entity recognition module comprises the following steps:

c1, constructing an Embedding layer, wherein the Embedding layer is formed by summing three types of Embedding, and the three types of Embedding comprise: token 12, segment 12, position 12;

the question text input module is used for inputting a question text and vectorizing the text; the BERT-CRF named entity recognition module is used for carrying out named entity recognition on the question text and recognizing the question entity; the knowledge base retrieval module is used for retrieving the problem entities to obtain candidate triple entities, feeding back the candidate attributes to the BERT attribute identification module, and combining the best attributes fed back by the BERT attribute identification module with the problem entities to obtain the final best triple; the BERT attribute identification module is used for carrying out correlation analysis on the candidate attributes and the problem text to obtain the optimal attributes, and feeding the optimal attributes back to the knowledge base; and the answer generating module is used for rewriting the optimal triple obtained by the knowledge base searching module into an answer text and outputting the answer text to the questioner.