CN113641809B

CN113641809B - Intelligent question-answering method based on XLnet model and knowledge graph

Info

Publication number: CN113641809B
Application number: CN202110913182.1A
Authority: CN
Inventors: 刘大伟; 胡笳; 车少帅; 张邱鸣; 张玮
Original assignee: Clp Hongxin Information Technology Co ltd
Current assignee: Clp Hongxin Information Technology Co ltd
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2023-12-08
Anticipated expiration: 2041-08-10
Also published as: CN113641809A

Abstract

The invention discloses an intelligent question-answering method based on an XLnet model and a knowledge graph, which comprises the following steps: training an XLNet Chinese model; acquiring corpus data; constructing an XLnet-BiGRU-CRF neural network model and training; entity recognition is carried out on text contents of user problems to be recognized; and extracting a plurality of related questions with corresponding entities from the database according to the entity recognition results, respectively comparing the user questions with the plurality of related questions by using the Embedding sentence vector cosine similarity, taking the answer of the related question with the largest similarity score as a target result, and simultaneously providing the second and third related questions ranked by the similarity score for the user as the similar questions for reference by the user. According to the method, the text corpus of the user problem is processed by using the trained model, and the answers to the problems can be obtained more rapidly and accurately by combining a knowledge graph retrieval method.

Description

Intelligent question-answering method based on XLnet model and knowledge graph

Technical Field

The invention belongs to the technical field of intelligent question and answer, and particularly relates to an intelligent question and answer method based on an XLnet model and a knowledge graph.

Background

In recent years, with the development of big data and artificial intelligence technology, a question-answering system is already applied to various industries, and the question-answering system also becomes a key component of an intelligent robot, which affects the important link of robot and human communication.

Conventional question-answering systems are generally based on keyword retrieval, and do not consider semantic information of the questions. The question-answering system based on the knowledge graph can conduct online analysis processing on the text of the specific questions presented by the questioner, and then conduct retrieval to output the best matched answers, so that accurate answers to the questions can be obtained quickly. The knowledge graph typically stores data in a triplet format, such as "< higher mathematics > < press > < university of martial arts press >", where "higher mathematics" and "university of martial arts press" are two entities, respectively, and "press" is a relationship between the two entities. The input to such a question-and-answer system is a sentence text query, and then one or a set of triples that are most relevant to the query are found in the knowledge base and the corresponding entities in the triples are returned.

The current mainstream methods are: a method based on relation classification, a method based on searching and a method based on semantic parsing. Taking a method based on relation classification as an example, the method predicts the entity and the relation from the question, and then finds out the answer entity according to the entity and the relation. The method has the common characteristics that a prediction model needs to be trained by questions and corresponding logic expression data, and compared with the construction of a knowledge graph, the method has higher cost for labeling a special data set and needs a labeling person to master certain expertise including field expertise and query language knowledge. While methods based on semantic analysis have obstacles between logical expressions and natural language semantics. Meanwhile, compared with the front edge models such as BERT, XLNet (Generalized Autoregressive Pretraining for Language Understanding), the common models such as CNN, LSTM and the like have poor training effect and accuracy, and lack correlation analysis on characters or words in a problem text.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an intelligent question-answering method based on an XLnet model and a knowledge graph, and in order to achieve the purposes, the invention adopts the following technical scheme:

an intelligent question-answering method based on an XLnet model and a knowledge graph comprises the following steps:

step 1: training an XLNet Chinese model based on a large-scale unlabeled corpus, wherein the XLNet model comprises an arrangement language model, a double-flow attention mechanism and a transducer-XL core component;

step 2: acquiring training corpus data for constructing a knowledge graph and a named entity recognition model, preprocessing and labeling the training corpus data, storing triplet data obtained by preprocessing the training corpus data into a Neo4j database, and respectively extracting the Embedding sentence vectors of a plurality of problems corresponding to the triplet data according to the XLNet Chinese model trained in the step 1 and storing the Embedding sentence vectors into the Neo4j database; the triples consist of a question entity, a question attribute and an answer;

step 3: constructing an XLnet-BiGRU-CRF neural network model based on the XLnet Chinese model trained in the step 1, and training the XLnet-BiGRU-CRF model by using the training corpus data marked in the step 2;

step 4: performing entity recognition on the text content of the user problem to be recognized by using the trained XLnet-BiGRU-CRF model to obtain an entity recognition result;

step 5: extracting a plurality of related triples data with corresponding entities in the Neo4j database according to the entity identification result in the step 4, extracting an Embedding sentence vector of a user problem to be identified by utilizing an XLNet Chinese model, respectively comparing the Embedding sentence vector with the Embedding sentence vector of a plurality of problems corresponding to the extracted related triples data to make cosine similarity comparison, taking an answer corresponding to the problem with the largest similarity score as a target result, and simultaneously providing the problems and the answer corresponding to the second and third related triples of the similarity score ranking to a user as a similar problem for reference of the user.

Further, the permutation language model described in step 1 is used to randomly disorder the order of Chinese characters in text sentences, and the given text sequence with length T has a permutation combination of different orders of Chinese characters of A _T A is one of permutation and combination and a epsilon A _T The modeling process of the permutation language model is expressed as

Wherein the method comprises the steps of，Representing the desire of all permutation and combination +.>To arrange and combine the t-th element in a text sequence, x _a＜t For arranging and combining the 1 st to (t-1) th elements in the a text sequence, theta is a model parameter to be trained, and p _θ Representing conditional probabilities.

Further, the dual stream attention mechanism in step 1 includes a text content attention stream and a query attention stream, the text content attention stream represents a self-attention mechanism including location information and content information, the query attention stream represents an input stream including only location information, the content information of the current location is not revealed when the required predicted location is predicted, and the text content attention stream and the query attention stream are combined to extract the characteristics of the related context information; the dual stream attention mechanism is specifically represented as follows:

wherein,query attention stream matrix vectors respectively representing m, m-1 layers, which contain only position information of input text,/->Content attention flow matrix vectors respectively representing m-th and m-1 th layers,/th>Representing the m-1 layer content Attention flow matrix vector when the arrangement and combination are 1 st to (t-1) elements in the a text sequence, wherein the Attention represents a classical self-Attention mechanism, and the calculation formula is as follows:

wherein Q, K, V are the input word vector matrix and dim is the input vector dimension.

Further, in the step 1, the XLNet chinese language model uses a transducer-XL framework as a core, and introduces a circulation mechanism and a relative position coding mechanism to use semantic information of a context to mine potential relations in a text vector.

In step 3, the feature vector output by the XLNet chinese language model is input to the biglu network, and the biglu network passes through the transmission and cutoff of the gate control information, and the specific state calculation formula is

Wherein x is _t An input vector representing the current t moment, and a feature vector representing the t word in the text; h is a _t 、h _t-1 The hidden layer state matrix vectors at the current time t and the previous time t are respectively represented;the candidate hidden layer state at the current time t is also the new memory at the current time. z _t Indicating the update gate for controlling the extent to which the state information of the previous moment is brought into the current state, z _t The larger the value of (c) indicates the more state information remains at the previous time; r is (r) _t Represents a reset gate for controlling the degree of ignoring state information of the previous moment, r _t Smaller values of (c) indicate more rejection. w (w) _z 、w _r 、/>Respectively representing the weight matrix of the update gate, the reset gate and the candidate hidden states. Sigma represents a sigmoid nonlinear activation function, tanh represents a tanh activation function, and x represents a point multiplication of a vector.

And the output vector passing through the BiGRU network coding unit is Z, and the output vector Z is input to the CRF layer after softmax probability normalization. For a given input sequence X, the probability of predicting the output tag sequence y is defined as S (X, y), where y= (y) ₁ ,y ₂ ,……y _n ) A tag sequence representing the number n of words contained in a sentence. The calculation formula of S (X, y) is as follows:

wherein,the element with the output vector Z of the BiGRU network coding unit is provided. />Is an element of the probability transition matrix output by the CRF layer, representing the output from the tag y _t-1 To y _t By utilizing the dependence among the labels, more reasonable label sequences are obtained. The probability of the whole tag sequence y can be seen to be the sum of the scores of the modules, and the score of each position is formed by two parts, wherein one part is the output probability matrix of the BiGRU network coding unit, and the other part is the output transition probability matrix of the CRF layer. The final prediction probability of the tag sequence y is obtained after normalization processing is carried out on the formula, and the formula is as follows:

wherein Y represents all possible tag sequences,one of all possible tag sequences;

the loss function L of the CRF layer adopts a negative log likelihood function, and the formula is as follows:

the Adam algorithm is adopted, the loss function of the CRF layer is utilized, parameters of the whole named entity recognition model are trained and updated, the parameters comprise model parameters including a BiGRU network and the CRF layer, and parameters of the XLNet Chinese model are kept unchanged.

Further, in the step 4, text content of the user problem to be identified is input into the XLNet-biglu-CRF model after training, the text is converted into feature vectors after passing through the XLNet chinese model, the feature vectors are subjected to feature extraction through the biglu network, and finally, the most probable labeling sequence in the text is obtained in the CRF layer by adopting the viterbi algorithm as a result of named entity identification.

Further, the calculation formula of the cosine similarity in step 5 is as follows:

wherein score is a similarity value, V _query An Embedding sentence vector V for user problems _corpus An Embedding sentence vector for the related problem.

The invention has the advantages and beneficial effects that: (1) The XLNet model used in the invention is completed based on the non-supervision training of large-scale non-labeling data, and the training can be better combined with context semantic information based on the permutation language model, so that the invention has strong text feature expression capability. (2) Based on the knowledge graph and the Neo4j database, the stored data set can be visualized more conveniently, and the searching speed is improved; (3) The XLNet model has strong text feature expression capability, and the introduction of the bidirectional GRU loop structure can better achieve the common coding of the context information. The CRF layer is accessed, so that the problem that the dependence among labels is not considered in the traditional entity identification can be effectively solved, and the accuracy of the identification result is further improved by combining the three modes.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a sample of a preprocessed corpus used to construct a named entity recognition model.

Detailed Description

The present invention is further described below with reference to the accompanying drawings in order to facilitate a study of the structure, features and technical contents of the present invention.

As shown in fig. 1, the method of the present invention comprises the steps of:

s1, training an XLNet Chinese model based on a large-scale non-labeling corpus.

The XLNet Chinese model mainly comprises an arrangement language model, a double-flow attention mechanism and a transducer-XL core component. Wherein the purpose of the language model is to randomly shuffle Chinese characters in text sentences, for Chinese character x _i The sequence { x } of the Chinese character originally appearing behind it _i+1 ,…,x _n It may also appear in front of it. Assume that all permutations of a text sequence of length T combine to A _T A is one of permutation and combination and a epsilon A _T The modeling process of the permutation language model is expressed as

Wherein,to arrange and combine the t-th element in a text sequence, x _a＜t For arranging and combining the 1 st to (t-1) th elements in the a text sequence, theta is a model parameter to be trained, and p _θ Representing conditional probabilities.

The XLNet adopts a double-flow Attention mechanism, wherein a text content Attention stream represents a Self-Attention mechanism containing position information and content information, a query Attention stream only contains an input stream of the position information, so that any content information of a current position can not be revealed when the query Attention stream predicts a required predicted position by utilizing the query Attention stream, the two mutually complement each other, and the characteristics of related context information can be better extracted, and the double-flow Attention mechanism is specifically represented as follows:

wherein,query attention stream matrix vectors respectively representing m, m-1 layers, which contain only position information of input text,/->Content Attention stream matrix vectors respectively representing m-th and m-1 th layers, which contain content information and position information of input text, and Attention represents classical self-Attention mechanism, and the calculation formula is as follows:

The XLNet Chinese model takes a transducer-xl frame as a core, introduces a circulation mechanism and relative position coding, and can better utilize context semantic information to mine potential relations in text vectors.

The XLNet Chinese model is trained on large-scale unlabeled data to obtain corresponding model parameters, and feature vector representation of an input sequence can be obtained through reasoning.

S2, training corpus data for constructing a knowledge graph and a named entity recognition model are obtained, and the data are preprocessed and labeled.

The preprocessed triplet data of the corpus data is stored in a Neo4j database for constructing a knowledge graph, and the data set is generally composed of triples (such as < higher mathematics > < publishing company > < wuhan university publishing company >) composed of question entities, question attributes and answers. And simultaneously, extracting an Embedding sentence vector of the triple sentence by using the XLNET Chinese model trained in the step S1, and storing the Embedding sentence vector in a Neo4j database.

Labeling problem texts according to triple entities, and constructing labeling corpus of a named entity recognition model, wherein only the entities need to be recognized, and [ "O", "B-LOC", "I-LOC" ] are used for labeling, wherein O represents other non-entities, B-LOC represents entity start, and I-LOC represents entity non-start words, as shown in fig. 2.

S3, constructing an XLnet-BiGRU-CRF neural network model on the basis of the XLnet Chinese model trained in the step S1, and simultaneously training the model by using the labeled data in the step S2.

Firstly, inputting the labeled corpus into a trained XLNet Chinese model, outputting feature vectors by the XLNet Chinese model, and then inputting the feature vectors into a BiGRU neural network model. The BiGRU network is actually simplification of the LSTM network, and the specific state calculation formula is as follows through the transmission and cut-off of gate control information:

wherein Y represents all possible tag sequences. The loss function of the CRF layer adopts a negative log likelihood function, and the formula is as follows:

and updating parameters of the whole named entity recognition model by using a loss function of a CRF layer by adopting an Adam algorithm, wherein the parameters comprise model parameters including a BiGRU neural network model and the CRF layer, parameters of an XLNet Chinese model are kept unchanged, and when a loss value generated by the model meets a set requirement or reaches a set maximum iteration number, training of the model is stopped.

S4, performing entity recognition on text content of the user problem to be recognized by using the XLnet-BiGRU-CRF model trained in the S3 to obtain a recognition result, wherein the method mainly comprises the following steps of:

s4-1, inputting text data to be identified into a trained XLnet-BiGRU-CRF neural network model;

s4-2, converting the text data into feature vectors after passing through an XLNet Chinese model, carrying out feature extraction on the feature vectors through a BiGRU network, and finally solving the maximum possible labeling sequence in the text by adopting a Viterbi algorithm in a CRF layer, namely, obtaining a named entity recognition result.

S5, extracting a plurality of related questions with corresponding entities in the Neo4j database according to the named entity recognition result, extracting an Embedding sentence vector of the user question to be recognized by using the XLNet Chinese model, and then respectively carrying out cosine similarity comparison with the Embedding sentence vector which is already stored in the Neo4j database and has the related questions of the corresponding entities in S2, taking the answer of the question with the highest similarity as a target result, and simultaneously providing the second and third ranked questions for the user as similar questions for reference of the user. The corresponding cosine similarity is calculated as follows:

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. An intelligent question-answering method based on an XLnet model and a knowledge graph is characterized by comprising the following steps:

step 1: training an XLNET Chinese model based on a large-scale unlabeled corpus, wherein the XLNET Chinese model comprises an arrangement language model, a double-flow attention mechanism and a transducer-XL core component;

step 2: acquiring training corpus data for constructing a knowledge graph and a named entity recognition model, preprocessing and labeling the training corpus data, storing triplet data obtained by preprocessing the training corpus data into a Neo4j database, and respectively extracting an Embedding sentence vector of a problem corresponding to the triplet data according to the XLNet Chinese model trained in the step 1 and storing the Embedding sentence vector into the Neo4j database; the triples consist of a question entity, a question attribute and an answer;

step 5: extracting a plurality of related triples of data with corresponding entities in a Neo4j database according to the entity identification result in the step 4, extracting an Embedding sentence vector of a user problem to be identified by utilizing an XLNet Chinese model, respectively comparing the Embedding sentence vector with the Embedding sentence vector of the problem corresponding to the extracted plurality of related triples of data to make cosine similarity comparison, taking an answer corresponding to the problem with the highest similarity score as a target result, and simultaneously providing the problems and the answer corresponding to the related triples of which the similarity scores are ranked second and third to a user as a similar problem for reference by the user;

the permutation language model described in step 1 is used for randomly disturbing the sequence of Chinese characters in text sentences, and given a text sequence with length T, the permutation combination of different sequences of Chinese characters is A _T A is one of permutation and combination and a epsilon A _T The modeling process of the permutation language model is expressed as

Wherein,representing the desire of all permutation and combination +.>To arrange and combine the t-th element in a text sequence, x _a＜t For arranging and combining the 1 st to (t-1) th elements in the a text sequence, theta is a model parameter to be trained, and p _θ Representing conditional probabilities;

the dual-stream attention mechanism in step 1 includes a text content attention stream and a query attention stream, the text content attention stream represents a self-attention mechanism containing position information and content information, the query attention stream represents an input stream containing only position information, the content information of the current position is not revealed when the needed predicted position is predicted, and the text content attention stream and the query attention stream are combined to extract the characteristics of the related context information; the dual stream attention mechanism is specifically represented as follows:

the query attention flow is:

the text content attention stream is:

wherein,query attention stream matrix vectors representing the m-th and m-1-th layers, respectively, which contain only the position information of the input text,/the text>Content attention stream matrix vectors respectively representing the m-th layer and the m-1 th layer, which contain content information and position information of an input text, < ->Representing the m-1 layer content Attention flow matrix vector when the arrangement and combination are 1 st to (t-1) elements in the a text sequence, wherein the Attention represents a classical self-Attention mechanism, and the calculation formula is as follows:

wherein Q, K, V are input word vector matrices, dim is the input vector dimension;

in the step 1, the XLNet Chinese model takes a transducer-XL frame as a core, and introduces a circulation mechanism and a relative position coding mechanism to utilize semantic information of a context so as to mine potential relations in text vectors;

in step 3, the feature vector output by the XLNet Chinese model is input to the BiGRU network, and the BiGRU network is transmitted and cut off through gate control information, and a specific state calculation formula is as follows

Wherein x is _t An input vector representing the current t moment, and a feature vector representing the t word in the text; h is a _t 、h _t-1 Respectively represent the current time and the previous timeA hidden layer state matrix vector at a moment;representing the state of the candidate hidden layer at the current time, which is also the new memory at the current time, z _t Indicating the update gate for controlling the extent to which the state information of the previous moment is brought into the current state, z _t The larger the value of (c) indicates the more state information remains at the previous time; r is (r) _t Represents a reset gate for controlling the degree of ignoring state information of the previous moment, r _t Smaller values of (c) indicate more rejects, w _z 、w _r 、/>Respectively representing a weight matrix of an update gate, a reset gate and a candidate hidden state, wherein sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh activation function;

the output vector passing through the BiGRU network coding unit is Z, the output vector Z is input to the CRF layer after being subjected to softmax probability normalization, and for a given input sequence X, the probability of a predicted output tag sequence y is defined as S (X, y), wherein y= (y) ₁ ,y ₂ ,……y _n ) The calculation formula of S (X, y) representing the tag sequence with n words contained in the sentence is as follows:

wherein,element with output vector Z for BiGRU network coding unit, < >>Is an element of the probability transition matrix output by the CRF layer, representing the output from the tag y _t-1 To y _t The transition probability of the tag sequence y is obtained after normalization processing of the formula,

the loss function L of the CRF layer uses a negative log-likelihood function,

training and updating parameters of the whole named entity recognition model by using a loss function of a CRF layer through an Adam algorithm, wherein the parameters comprise model parameters including a BiGRU network and the CRF layer, and parameters of an XLNet Chinese model are kept unchanged;

in the step 4, inputting the text content of the user problem to be identified into the trained XLNet-biglu-CRF model, converting the text content into feature vectors after passing through the XLNet chinese model, extracting features of the feature vectors through the biglu network, and finally solving the most probable labeling sequence in the text at the CRF layer by using viterbi algorithm as the result of named entity identification.

2. The intelligent question-answering method based on the XLnet model and the knowledge graph as claimed in claim 1, wherein the cosine similarity calculation formula in the step 5 is: