CN107832326B

CN107832326B - Natural language question-answering method based on deep convolutional neural network

Info

Publication number: CN107832326B
Application number: CN201710841026.2A
Authority: CN
Inventors: 来雨轩; 冯岩松; 贾爱霞; 赵东岩
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2017-09-18
Filing date: 2017-09-18
Publication date: 2021-06-08
Anticipated expiration: 2037-09-18
Also published as: CN107832326A

Abstract

The invention discloses a natural language question-answering method based on a deep convolutional neural network. The method comprises the following steps: 1) expressing the information in the natural language question and the database information set into vectors with a sequence structure, and forming a vector matrix; 2) processing the vector matrix by adopting a deep convolutional neural network, and extracting corresponding deep semantic features; 3) calculating the semantic correlation degree of the natural language problem and the information in the database information set according to the deep semantic features; 4) and selecting information in the database information set according to the calculated semantic relevance to generate an answer of the natural language question. The invention can better extract the deep generalized semantic features and accurately position the supporting data information, thereby obtaining better natural language question-answering effect.

Description

Natural language question-answering method based on deep convolutional neural network

Technical Field

The invention relates to a method for extracting semantic features of natural language questions and candidate information by utilizing a deep convolutional neural network to enhance the effect of correlation calculation so as to improve the accuracy of natural language question answering, and belongs to the field of natural language question answering.

Background

With the development of information technology and internet, information overload is more and more serious, and how to effectively understand the requirements of users and cross the gap between query and existing information inconsistency, so that the method for effectively acquiring the user requirements from a large amount of information becomes a very important problem.

A user's query typically appears as a problem expressed using natural language. The expression form of the resource database providing the answer information can be various, and a structured knowledge base can be formed by triples in the shapes of (subject, predicate and object), for example, the triples (China, capital and Beijing) contain the knowledge that the capital of China is Beijing; or a text set composed of a large number of common natural language sentences, and the corpus may come from various platforms such as encyclopedia, news, social media and the like and combinations thereof, such as "i come to the university of reading of capital-Beijing-China. "the knowledge that the first capital of China is Beijing is also included; similarly, the resource database may be a combination of multiple pieces of information in various forms. An important process in natural language question answering is to evaluate semantic relevance between information in a resource database and questions queried by a user, so as to select the most effective information to help answer the questions of the user.

Natural language problems are often characterized by flexibility and variability, and the organization of information in a resource database is also complex, so that it becomes a challenging task to effectively extract features to calculate semantic relevance between candidate information and natural language problems. The convolutional neural network can automatically organize the structures between adjacent words, extract the overall semantic features of the text and abstract and summarize semantic information. The deep convolutional neural network has deeper layers and a more complex structure, can integrally process semantics in a larger input window by using fewer parameters and model the semantics into deeper and more abstract feature representation, is favorable for better processing the problems of complexity of organization forms of natural language questions and candidate information, inconsistent expression among the natural language questions and the candidate information and the like, and better expresses the semantics of the questions and the candidate information in a unified feature space, thereby better calculating the semantic correlation among the questions and the candidate information and improving the accuracy of natural language question answering.

Disclosure of Invention

The invention aims to provide a method for better extracting semantic features of natural language questions and candidate information to assist in calculating semantic correlation between the natural language questions and the candidate information so as to improve the accuracy of natural language question answering. That is, for the natural language question q and the database information set D ═ D_i}. Extracting a corresponding feature vector by using a deep neural network method:

and

and calculating the problem q and each database information D according to the problem q_iSimilarity between S ═ S_qiAnd based on this, select several pieces of information most relevant to the problem,and generates answers to the questions accordingly.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a natural language question-answering method based on a deep convolutional neural network comprises the following steps:

1) expressing the information in the natural language question and the database information set into vectors with a sequence structure, and forming a vector matrix;

2) processing the vector matrix by adopting a deep convolutional neural network, and extracting corresponding deep semantic features;

3) calculating the semantic correlation degree of the natural language problem and the information in the database information set according to the deep semantic features;

4) and selecting information in the database information set according to the calculated semantic relevance to generate an answer of the natural language question.

The database information set in the step 1) is an original database information set or a candidate information set which is obtained by information screening and has a reduced range.

The specific steps of the process of the invention are further illustrated below:

(1) for each question q and the whole database information set D, reducing the range of effective information by some low-overhead means such as retrieval and the like and carrying out preliminary sorting on the screened results to obtain a candidate information set with the reduced range

(2) Natural language questions and candidate information are formulated as vector representations having a sequence structure that can be used for deep convolutional neural network input. I.e. for each representation E q ∈ { q }, u D of the input_qIt can be expressed as a set of terms: t ═ T_nEach of which is t_iA vector representation may be used and a full-order relationship may be defined on the set of items T. At this time, E can be expressed in a matrix form according to the sequential relationship: m_E＝(v₁；v₂；…；v_n) Wherein v is_iIs t_iThe corresponding vector is represented by a vector that,

to pair

Wherein

The method is a symbol for indicating the order relationship, is used for expressing an abstract and wide order relationship, and is not limited to numerical order, lexicographic order and other order relationships defined naturally.

(3) Input matrix M is aligned using a multi-step long deep convolutional neural network with Residual chaining (Residual Connections)_EIs subjected to a treatment of M₀＝M_E，M_i+1＝M_i+CNN(M_i). Where CNN represents a convolutional neural network. At M_iAnd M_i+1When the feature numbers are different, the left term of the front formula needs to be subjected to one-dimensional convolution processing with the width of 1 to adjust the feature dimension. Between layers, additional pooling layers may be added to reduce the size of the feature matrix. The final output E eigenvector f ═ pooling (M)_N) And the global pooling result is output for the last layer of convolution.

Residual Connections (Residual Connections) refer to adding the previous convolutional layer output result directly to the current convolutional layer output, replacing the output of the current layer for subsequent processing, and if the two tensor feature dimensions added have a difference, processing the convolutional layer with a width of 1 for dimension adjustment. The linkage can enable the deep convolutional network to adaptively adjust the network depth to a certain extent in the learning process, and reduce the influence of the deep convolutional network on gradient propagation.

(4) According to

And

calculating a score S ═ S for each candidate information representation_qiAnd (6) calculating the relevance of each candidate message to the problem. And accordingly from D_qSeveral pieces of information are selected as the main basis for generating answers.

(5) And generating answers of the questions according to the selected information and the weights by using a corresponding natural language generation technology.

In step (1), it is also feasible not to screen the information if the computational power is sufficient, but considering that the scale of D tends to be large in the question-and-answer task, it is also necessary to narrow down the candidate size by excluding irrelevant information using some simple means. If for the screening result D_qThe sorting cannot be performed or the sorting with better effect is difficult to provide, and the disorder of the returned candidate information is also feasible, which can be completely compensated by the subsequent deep convolutional neural network training process.

In step (2), the vector representation may be a pre-trained low-dimensional dense semantic representation vector such as a word vector trained by using a neural language model, a word vector obtained by reducing the dimensions of a high-dimensional matrix by using a Singular Value Decomposition (SVD) method (e.g., a result obtained by LSA (Latent semantic analysis), and the like), or an original high-dimensional sparse vector such as a one-hot vector. The full-order relationship can be a natural full-order relationship such as word order, or an artificially defined full-order relationship. For example, for the full-segmentation word segmentation result, the order may be defined as a priority initial order, and then the final order is considered. In fact, the requirement of the deep convolutional network on the input is slightly weaker than that of the existing full-order relation, and we assume that the jth convolution kernel of the ith layer has the width of m_ijThe subset of items involved in the computation is clustered as { T }_ijAre only needed to

Can define

On

Meta proximity relations

That is, it is defined as follows:

and is

May define a fully ordered relationship on the elements }, and recursively,

this is mainly only for the reason that the convolution process can be performed efficiently.

In step (3), the interlayer relationship of the deep convolutional network with residual linkage can be represented by the following formula:

wherein M is_iIs the characteristic matrix after the ith layer,

for adjusting M_iAnd M_i+1A convolution calculation of width 1 with a possible difference in the number of features between them, at Σ_jk_ij＝∑_jk_i-1jThe constant identity transformation can be omitted, concat is a connection function, and a series of tensors of the input are connected on the last component. m is_ijIs the width, k, of the jth convolution kernel of the ith layer_ijThe number of the jth convolution kernels of the ith layer. And CNN_mkRepresenting a convolution operation with an activation function having a width m and a filter number k, satisfying:

CNN_mk(M_E)＝CNN_mk((v₁；v₂；…；v_n))＝(v_o1-|m|/2；v_o2-|m|/2；…；v_on-|m|/2)

wherein the content of the first and second substances,

if the subscript of v is less than 1 or greater than n, then 0 vector padding is used, i.e.

v>n or v<1. Wherein g represents a non-linear activation function, such as sigmoid function or hyperbolic tangent function, W_mjRepresenting the weight matrix used in the convolutional layer calculation, T representing the matrix transposition operation, b_mjThe offset vector used in the convolutional layer calculation may be degraded if necessary without being added.

Finally, after the N convolutional layers are laminated, the input can be characterized by the following equation:

f_E＝pooling(M_N)

wherein the pooling can be any global pooling operation, such as maximal pooling or tie pooling.

The step 3) can be realized by adopting the following three methods besides the above method:

the method comprises the following steps:

wherein M is_iThe characteristic matrix after the ith layer is taken as the characteristic matrix;

for adjusting M_iAnd M_i+1The width of the difference of the characteristic numbers possibly existing between the two is 1, and the number of the filters is sigma_jk_ijAt sigma, in the convolutional layer calculation of_jk_ij＝∑_jk_i-1jCan be ignored as constant conversion; concat is a connection function, and a series of tensors input are connected on the last component; flatten is a flattening function, and an input matrix is flattened into a vector; m is_ijIs the width, k, of the jth convolution kernel of the ith layer_ijFor the number of jth convolution kernels at the ith layer,

represents a width of m at one time_ijThe number of filters is k_ijWith an activation function.

The method 2 comprises the following steps:

wherein M is_iThe characteristic matrix after the ith layer is taken as the characteristic matrix; concat is a connection function, and a series of tensors input are connected on the last component; posing is a pooling function, and an input matrix is converted into a vector by calculating information such as the maximum value, the minimum value, the average value, the median and the like of each row of the matrix; m is_ijIs the width, k, of the jth convolution kernel of the ith layer_ijFor the number of jth convolution kernels at the ith layer,

The method 3 comprises the following steps:

wherein M is_iThe characteristic matrix after the ith layer is taken as the characteristic matrix; concat is a connection function, and a series of tensors input are connected on the last component; flatten is a flattening function, and an input matrix is flattened into a vector; m is_ijIs the width, k, of the jth convolution kernel of the ith layer_ijFor the number of jth convolution kernels at the ith layer,

If necessary, the above obtained feature vector f can be used_EAfter some transformation, e.g. using several neural network full-link layer pairs f_EIs processed to obtain f'_EThen f 'is used'_EIn place of f_EAnd participating in subsequent calculations and operations. One or more layers of recurrent neural networks (e.g., structures such as LSTM, GRU, etc.) may be added before or after any convolutional layer computation to perform additional processing on the sequence of feature vectors. Additional processing of the sequence of feature vectors may also be performed by adding pooling operations (e.g., pooling operations of any of the above, such as pooling, k-max pooling) before or after any of the convolutional layer calculations. These additional optional operations may be selected with reference to the characteristics of the data set used in the method implementation.

In the process of multi-layer lap machine network training, skills such as dropout layers or regular terms L1 and L2 can be added to limit the expression capability of the model, enhance the generalization capability and prevent over-learning. During training, in order to prevent the distribution rule of the whole feature matrix from being greatly changed along with the number of layers due to deep convolution operation, and to cause gradient disappearance or gradient explosion, some normalization processing means such as the mean value and variance of the distribution of the part or the whole of the feature matrix on one training input batch can be added between layers.

In step (4), there are many methods for calculating the score, such as:

using cosine similarity directly:

using the multi-layer perceptron after the connection:

wherein the content of the first and second substances,

composite operations representing functions, per-layer perceptron operations

Where i ∈ {1,2, …, N_f-g denotes a non-linear activation function;

or the number of wins using pairwise comparisons such as:

where the notation num indicates that j, i is looped, the subscript indicates that j ═ i is skipped, and P indicates probability.

In fact, any of the forms is

Can be used to calculate the score.

In step (5), there are many implementation forms according to the generated answers, and there is usually a great relationship with the specific question-answer data form and task setting. Simpler examples such as the one D that will score the highest_qAs an answer to a question (in the case of D being a set of texts) or the one D that will score the highest_qAs an answer to the question (in the case of D being a knowledge base triplet). More complex methods such as the Memory Neural Network (MNN) may also be used.

The invention also provides a server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method described above.

Compared with the prior art, the invention has the following positive effects:

the invention uses deep convolutional neural network to extract semantic features so as to better extract and summarize semantic representation of question and answer related information. In the natural language question-answering task, compared with the traditional method, the method can extract semantic information with higher level, so that the method can better adapt to the characteristics of expression flexibility, information organization inconsistency and the like in the natural language question-answering task, and can better evaluate the semantic correlation degree of the question and the candidate information so as to improve the effect of natural language question-answering.

Drawings

Fig. 1 is a framework diagram of a natural language question answering method in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention is based on a knowledge base data set and a question set provided by an evaluation task of a sixth natural language processing and Chinese computing conference (NLPCC 2017) Chinese knowledge base question and answer. It should be clear to those skilled in the art that other candidate information sets and problem sets may be used in the implementation.

Specifically, in this embodiment, there are 32110 question-answer pairs, 7631 questions as tests, and 24479 questions for training, such as: "do you know when Querdaloguburg was finished? "; there are about 4 thousand 3 million pieces of information organized into knowledge base triplets for providing candidate information, such as: (Kuudaroubau, completion time, 1890 years).

Fig. 1 is a schematic diagram of a natural language question-answering method based on a deep convolutional neural network according to an embodiment of the present invention, where the method includes the following steps:

step 1: and reducing the candidate information size according to the problem.

According to the characteristics of the data set, in the information screening process, the embodiment of the invention limits the substrings of which the candidate information main bodies must be problems. Subsequently, features are extracted using the rules and a gradient iterative decision tree (GBDT) model is trained using the training data to score all substrings, taking only the top three substrings with scores no less than the first name score 1/100 as possible candidate entities. There are only about a few dozen database information entries with subject candidate entities per topic on average (e.g., the average number of candidate information entries over 7631 test set questions is 62.93).

Then, using the full segmentation word attention model, the top 20 ranked information of each question is selected as the question q candidate information set Dq for subsequent processing. (see Yuxuan Lai, Yang Lin, Jianhao Chen, Yansong Feng, and Dongyan Zha.: Open Domain query System Based on Knowledge base. in Proceedings of NLPCC 2016.)

Step 2: the question and the candidate information are represented as a representation of a sequence of vectors.

Both the problem q and candidate information set Dq triplets are represented using word vector sequences. The words used for the problems are word segmentation results of the rest part after the triple main body is removed, and the words used for the triples are word segmentation results of the triple predicate part. Thus, each question and each candidate information set entry may be expressed in the form of a word vector matrix. Word vectors trained using the google word2vector model on chinese encyclopedia corpus are used.

And step 3: deep semantic features are extracted using a deep convolutional neural network.

After the word vector matrix is obtained, the deep convolutional neural network is adopted to extract deep semantic information. The convolutional neural network uses 2-3 convolutional layers, each convolutional layer having 256 convolutional kernels of width 1, 512 convolutional kernels of width 2, and 256 convolutional kernels of width 3. Residual connection is added between the convolutional layers.

And 4, step 4: and calculating the semantic relevance according to the semantic features.

After obtaining the semantic features, the invention integrates the features from the problem and the candidate information by adopting bitwise multiplication, and then calculates the specific similarity score through a multilayer perceptron model with 1 hidden layer and 1024 hidden nodes.

And 5: generating answers based on semantic relatedness selected information

Due to the characteristics of the data adopted by the embodiment of the invention, the object character string of the information with the highest degree of correlation is directly selected as the answer of the question.

During model training, in order to deal with the condition that the positive and negative samples in the question-answering task are not uniformly distributed, the negative samples are sampled by using the model result (the result of the first round of using the step one or random) obtained in the previous round of training in each round of training, the negative samples are considered to be closer to the positive samples, and the sampling probability is higher when the next round of training data is generated. Essentially consisting in a specific reinforcement in the follow-up training against the weakness of the previous training. The sampling probability used by the invention for the negative sample is shown as follows:

wherein the content of the first and second substances,

for a certain negative sample at t_iProbability of time selection, rank_i-1To use t_i-1Model of time of day, ranking of the sample among all candidates. In actual training, one time node is taken in each iteration, and 7 rounds of training are performed in total.

Table 1 shows the natural language question-answering effect of the method, in which the evaluation index is the first accuracy, i.e., the frequency with which the first-ranked answer to the question is the correct answer.

Layer number:	1	2	3	5
					top1 accuracy:	43.57％	43.82％	43.85％	42.13％

TABLE 1

As can be seen from the table, the deep convolutional neural network can obtain better expression effect along with the increase of the number of model layers in a certain range, which shows that the method can better extract deep semantic information, thereby better calculating the semantic correlation between the question and the candidate information and obtaining better natural language question-answering effect.

In summary, in the embodiment of the present invention, a reliable natural language question-answering system is constructed based on the knowledge base data set and the question set provided by the evaluation task of the sixth natural language processing and chinese language computation conference (NLPCC 2017) chinese knowledge base question-answering. In the process of selecting the basis information, the method provided by the invention can effectively extract the semantic features of the summarized question and the candidate information, thereby better calculating the semantic correlation between the question and the candidate information and obtaining better natural language question-answering effect.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

Claims

1. A natural language question-answering method based on a deep convolutional neural network is characterized by comprising the following steps:

4) selecting information in the database information set according to the calculated semantic relevance to generate an answer of the natural language question; wherein, the step 2) uses a multilayer convolution neural network to generate a vector matrix M according to the step 1)_EGenerating deep semantic features f_ENamely, the following conditions are satisfied:

M₀＝M_E，

f_E＝pooling(M_N) Wherein M is_iThe characteristic matrix after the ith layer is taken as the characteristic matrix;

for adjusting M_iAnd M_i+1The width of the difference of the characteristic numbers possibly existing between the two is 1, and the number of the filters is sigma_jk_ijRepresents a first order width of 1 and the number of filters is sigma_jk_ijWith an activation function, at ∑_jk_ij＝∑_jk_i-1jCan be ignored as constant conversion; concat is a connection function, a series of inputted tensors are connected on the last component, and j below the concat represents connection of a dimension j; posing is a pooling function, and input matrixes are formed into a vector by calculating the maximum value, the minimum value, the average value and the median of each row of the matrix; m is_ijIs the width, k, of the jth convolution kernel of the ith layer_ijFor the number of jth convolution kernels at the ith layer,

represents a width of m at one time_ijThe number of filters is k_ijConvolution operations with activation functions;

or, step 2) using a multilayer convolutional neural network to generate the vector matrix M according to step 1)_EGenerating deep semantic features f_ENamely, the following conditions are satisfied:

M₀＝M_E，

f_E＝flatten(M_N) Wherein M is_iThe characteristic matrix after the ith layer is taken as the characteristic matrix;

M₀＝M_E，

f_E＝pooling(M_N)

wherein M is_iThe characteristic matrix after the ith layer is taken as the characteristic matrix; concat is a connection function, and a series of tensors input are connected on the last component; posing is a pooling function, and input matrixes are formed into a vector by calculating the maximum value, the minimum value, the average value and the median of each row of the matrix; m is_ijIs the width, k, of the jth convolution kernel of the ith layer_ijFor the number of jth convolution kernels at the ith layer,

M₀＝M_E，

f_E＝flatten(M_N)

2. The method of claim 1, wherein the database information set of step 1) is an original database information set or a candidate information set with a reduced scope obtained by information screening.

3. The method of claim 1, wherein the deep semantic features f are combined_EAfter processing of a plurality of full connection layers, a new vector representation f 'of sentence i is obtained'_EThen according to f'_EAnd calculating the semantic relevance of the natural language problem and the information in the database information set.

4. The method of claim 1, wherein additional pooling functions are added between convolutional layer operations.

5. The method of claim 1, wherein one or more layers of recurrent neural networks are added to process the feature matrix before or after any convolutional layer operation.

6. A server, characterized in that the server comprises a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 5.