CN108052588B

CN108052588B - Method for constructing automatic document question-answering system based on convolutional neural network

Info

Publication number: CN108052588B
Application number: CN201711309921.6A
Authority: CN
Inventors: 吴明晖; 范旭民; 金苍宏; 朱凡微; 赵品通; 方格格
Original assignee: Zhejiang University City College ZUCC
Current assignee: Zhejiang University City College ZUCC
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2021-03-26
Anticipated expiration: 2037-12-11
Also published as: CN108052588A

Abstract

A method for constructing a document automatic question-answering system based on a convolutional neural network comprises the following steps: step 1, constructing a theme document library; step 2, constructing a word vector model; step 3, matching the subjects; step 4, constructing a word vector matrix; step 5, semantic matching is carried out based on a semantic model of the convolutional neural network; the semantic model of the convolutional neural network is divided into three layers; the first layer is a convolutional neural network layer; the second layer is an attention layer; the third layer is a full connection layer; and 6, an answer selection process, namely selecting a matched answer. The method of the invention does not need to manually construct a synonym dictionary, saves a large amount of labor and time cost, can purposefully sample the semantics of the word context in the training process of the model, adds an attention mechanism in the network, and improves the contribution degree of some representative words to the semantics of the whole sentence.

Description

Method for constructing automatic document question-answering system based on convolutional neural network

Technical Field

The invention relates to the field of natural language processing and artificial intelligence, and provides a scheme for performing semantic modeling and semantic matching on a question and an answer under the background that a deep learning algorithm is applied to natural language processing in a large scale by applying a convolutional neural network algorithm.

Background

The most important technology in automatic question answering is sentence semantic matching, and most of the traditional methods construct rule combinations conforming to scenes on the basis of HowNet (HowNet), large-scale dictionaries and Harmony large synonym word forest tools to achieve the purpose of calculation. The method has the advantages that the calculation model is quickly established, the self semantics of the words can be effectively utilized, the model is quickly adjusted, and the method has the defect that the semantics of the word context, even the semantics of the whole sentence or paragraph, cannot be effectively utilized. Therefore, the traditional method easily causes the semantic loss of the word context, and the obtained result cannot be accurately used for calculating the matching degree between sentences.

Some traditional methods for semantic matching of sentences cannot effectively utilize the semantics of word context and have high requirements on labor and time cost, are not as good as the current popular deep learning methods in matching effect, and are difficult to adapt to the requirements of enterprises on automatic question and answer technologies in the background of the internet economic era with explosive increase of data volume.

Disclosure of Invention

The invention aims to provide a method for constructing a document automatic question-answering system based on a convolutional neural network. Therefore, the present invention adopts the following technical solutions.

A method for constructing a document automatic question-answering system based on a convolutional neural network comprises the following steps:

step 1, constructing a theme document library; establishing a theme document library according to different application scenes, wherein the theme document library comprises k theme documents and aims at k types of problems; each topic document corresponds to a question type and is a candidate answer set of the question type;

step 2, constructing a word vector model; obtaining a corpus, training the corpus by using a word2vec tool to obtain a word vector model, wherein one word corresponds to one word vector in the word vector model, L is the dimension of the word vector, the word vector can represent the distance between the words in a multi-dimensional space, and the word vector model can accurately represent the semantic similarity between the words;

step 3, matching the subjects; receiving a first question provided by a user, classifying the first question according to the topic document library constructed in the step 1, and finding out a first topic document corresponding to the first question, wherein n are contained in the first topic document₁A plurality of alternative answers;

step 4, constructing a word vector matrix; dividing the first question into m words, and constructing a first question matrix A according to the word vector model in the step 2; according to the stepsThe first theme document obtained by theme matching in step 3 divides the alternative answers in the first theme document into m words, and constructs n words according to the word vector model in step 2₁Each first answer matrix Q corresponds to one alternative answer; construction n₁A word vector matrix M, wherein M ═ M<A，Q>；

Step 5, semantic matching is carried out based on a semantic model of the convolutional neural network; the semantic model of the convolutional neural network is divided into three layers;

the first layer is a convolutional neural network layer, the input of the convolutional neural network layer is a word vector matrix M, the width of a convolutional kernel of the convolutional neural network layer is the dimension of the word vector, and the number of the convolutional kernels is n₂(ii) a Inputting the word vector matrix M into the convolutional neural network layer to obtain n₂Problem feature vector sum n of dimensions₂A dimensional answer feature vector; take n₃Convolution kernel of different heights to obtain n₃A question feature vector and an answer feature vector; n is to be₃Combining the problem feature vectors into a problem feature matrix, and combining n₃Combining the answer feature vectors into an answer feature matrix;

the second layer is an attention layer, and the attention layer is used for weighting the question feature vector and the answer feature vector; the input of the attention layer is a question feature matrix and an answer feature matrix, and the output is a question sentence vector and an answer sentence vector;

the third layer is a full connection layer, the full connection layer is used for calculating the semantic matching degree between the question sentence vector and the answer sentence vector, and the semantic matching degree is expressed by a semantic matching degree value;

step 6, answer selection process; according to step 5, according to n₁A word vector matrix M, calculating n₁And selecting a matching answer according to the semantic matching degree value.

Preferably, the weighting formula of the attention layer is as follows: g ═ softmax (tanh (IW + b) e), i ═ GI; wherein, I is an input feature matrix, W is a weighting matrix, b and e are weighting vectors, softmax is a normalization function, G is a weighting weight vector, and I is an output sentence vector; substituting the question feature matrix and the answer feature matrix into the formula, and outputting a question sentence vector and an answer sentence vector.

Preferably, the word segmentation process of the first question and the first answer adopts a word segmentation method based on an N-gram model.

Preferably, said step 6 comprises adding n₁And arranging the semantic matching degree values, and selecting the alternative answer corresponding to the maximum semantic matching degree value as the matching answer.

Preferably, said step 6 comprises adding n₁And arranging the semantic matching degree values, setting a first threshold value, and selecting an answer set of alternative answers corresponding to the semantic matching degree values larger than the first threshold value as matching answers.

The invention has the beneficial effects that: (1) the method of the present invention aims to learn "features" generated from a large set of input data, which play an important role in semantic modeling. (2) The semantics are represented by the vector model obtained by training, and a synonym dictionary does not need to be manually constructed, so that a large amount of labor and time cost is saved. (3) The semantics of the word context can be purposefully sampled during the training of the model. (4) And an attention mechanism is added into the network, so that the contribution degree of some representative words to the whole sentence semantics is improved.

Drawings

Fig. 1 is a schematic overall flow chart of an embodiment of the present invention.

FIG. 2 is a schematic diagram of the structure of the document library of the present invention.

FIG. 3 is a flow chart illustrating the matching process of the subject invention.

FIG. 4 is a flowchart illustrating an application environment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 to 4, a method for constructing a document automatic question-answering system based on a convolutional neural network includes the following steps:

step 1, constructing a theme document library; establishing a theme document library according to different application scenes, wherein the theme document library comprises k theme documents and aims at k problems; each topic document corresponds to a question type and is a candidate answer set of the question type. Each question has its corresponding answer set, and a document can be regarded as an alternative answer set for a certain topic. A complete document library is built based on the types of questions serviced by the system. For example, "ask for your store for all goods supporting national joint insurance? "," ask for return of goods? "and the like with respect to online shopping returns. The corresponding answer set can enjoy national joint guarantee service by 'certificate of warranty and local store invoice', and the customer can return goods without reason within 7 days of purchasing goods. "etc., each answer being a sentence. Based on the above description, the subject documents in the scenario are established according to different application scenarios, and each document corresponds to a question type.

Step 2, constructing a word vector model; obtaining a corpus, training the corpus by using a word2vec tool to obtain a word vector model, wherein one word corresponds to one word vector w in the word vector model, L is the dimension of the word vector, the word vector can represent the distance between words in a multidimensional space, the word vector model can accurately represent the semantic similarity between words, and L can be 100. The Chinese corpus provided by Chinese Wikipedia and dog searching laboratories and comment data crawled from microblogs are collected, and the size of the Chinese corpus is about 5 GB. The method comprises the steps of training the materials by using a word2vec tool open by Google to obtain a word vector model, wherein each word in the model corresponds to a vector and is used for representing the distance between words in a multidimensional space, and the model can accurately represent the semantic similarity between Chinese words.

Step 3, matching the subjects; receiving a first question provided by a user, classifying the first question according to the topic document library constructed in the step 1, and finding out a first topic document corresponding to the first question, wherein n is in the first topic document₁The number of alternative answers, where n should not exceed 50 in general per set₁And taking 50. On the basis that step 1 has been completed, it is assumed that k documents have already been created, for the k-class problem. After receiving a question posed by a user, the question needs to be classified to find out a corresponding subject. On this result, the answer that best matches the question is found from the document for that topic. In order to accurately match the theme, the invention provides two matching modes: (1) the user selects the theme such as 'return goods', 'warranty' and the like autonomously. (2) And the system is matched automatically.

Step 4, constructing a word vector matrix; dividing the first question into m words, and constructing a first question matrix A according to the word vector model in the step 2; dividing the alternative answers in the first subject document into m words according to the first subject document obtained by subject matching in the step 3, and constructing n words according to the word vector model in the step 2₁A first answer matrix Q, each alternative answer corresponding to a first answer matrix, i.e. forming an m × L first question matrix A, n₁M × L first answer matrix

Where n is₁And taking 50.

w₁，w₂，…w_mAfter the first problem matrix A is divided into m words, a word vector corresponding to each word.

w₁，w₂，…w_mAfter the first question matrix Q is divided into m words, a word vector corresponding to each word. The word vector may represent the distance of the word in L space. Where m may be 50, when a problem arisesOr the answer is too short to be divided into 50 words, the corresponding positions of the first question matrix a and the first question matrix Q may be complemented by 0 or by random numbers. Construction n₁I.e. 50 word vector matrices M, where M ═ M<A，Q>. And (4) according to the result of the theme matching in the step (2), obtaining a first document D corresponding to the Q, wherein 50 alternative answers exist in the D. The step is to divide the words of the question and the answer, obtain word vectors corresponding to the question and the answer according to the word vector model in the step 2, then construct 50 question-answer pairs, and use a word vector matrix M as the input<A，Q>Represents, a<A，Q>A word vector matrix corresponding to the question and the answer. Each one of which is<A，Q>Will be used as input for the next step. In order to disambiguate the characters of the questions and the answers as much as possible, a word segmentation method based on an N-gram model is adopted in the word segmentation process of the questions and the answers.

the first layer is a convolutional neural network layer, the input of the convolutional neural network layer is a word vector matrix M, the width of a convolutional kernel of the convolutional neural network layer is the dimension of a word vector, and the number of the convolutional kernels is n₂Where n is₂Taking 100; inputting the word vector matrix M into the convolutional neural network layer to obtain n₂Dimensions, namely a 100-dimensional question feature vector and a 100-dimensional answer feature vector. The height of the convolution kernel can be regarded as the height of a window for sampling word context, and because the sampled features of different window heights have a certain difference, 4 kinds of height convolution kernels are taken here, the height is between 2 and 5, and the number of each convolution kernel is 100, so as to take enough features as possible. The number of layers of the neural network here is generally 1. . After sampling, maximum pooling and other operations are performed on each convolution kernel, a 100-dimensional feature vector is obtained, for a sentence, 4 kinds of convolution kernels obtain 4 100-dimensional feature vectors, and the 4 feature vectors are spliced to form a 4 x 100 matrix. Taking 4 kinds of high convolution kernel to obtain 4 100-dimensional question feature vectors and answer feature vectors, combining 4 question feature vectors into 4 x 100 questionTopic feature matrix A_cCombining 4 answer feature vectors into a 4 × 100 answer feature matrix Q_c。

The second layer is an attention layer for weighting the question feature vector and the answer feature vector. And adding an attention mechanism, weighting word vectors corresponding to different words in the first question and the first answer by different weight values, and improving the contribution degree of some representative words to the whole sentence semantics. The input of the attention layer is a problem feature matrix A_cAnd answer feature matrix Q_cThe outputs are question sentence vector a and answer sentence vector b. The weighting formula for the attention layer is: g ═ softmax (tanh (IW + b) e), i ═ GI. Wherein, I is an input feature matrix of 4 × 100, W is a weighting matrix of 100 × 100, b and e are weighting vectors of 100 dimensions, softmax is a normalization function, G is a weighting weight vector, and I is an output sentence vector. W, b and e are all random values initially, and are optimized by an Adam random optimization method in the network training process. The values in the 4-dimensional vector G obtained after normalization by the softmax function are added to 1. Problem feature matrix A_cAnd answer feature matrix Q_cSubstituting the above formula, and outputting 100-dimensional question sentence vector a and answer sentence vector b.

The third layer is a full-connection layer, the full-connection layer is used for calculating semantic matching degree between the question sentence vector a and the answer sentence vector b, the semantic matching degree is expressed by semantic matching degree values, the value of each score value is [0,1], and represents the cosine similarity after normalization. The semantic matching degree can be calculated based on the distance between the question sentence vector a and the answer sentence vector b.

And 6, answer selection process. According to the step 5, 50 semantic matching degree values are calculated according to the 50 word vector matrixes M, and matching answers are selected according to the semantic matching degree values. The 50 semantic matching degree values can be arranged, and the alternative answer corresponding to the largest semantic matching degree value is selected as the matching answer. Or arranging the 50 semantic matching degree values, setting a first threshold value, and selecting an answer set of alternative answers corresponding to the semantic matching degree values larger than the first threshold value as matching answers.

Missing problems and documents can be remedied. If the system successfully matches the user question with the answer, the round of question answering is considered to be completed successfully. Otherwise, the background needs to record and is used as a basis for perfecting the document library later. The data to be recorded are: all questions, documents to which the system matches the question, questions to which the system fails to successfully match the document, questions to which the system does not successfully match the answer in the document. On the basis of the data, the system user can carry out corresponding document and answer compensation work.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for constructing a document automatic question-answering system based on a convolutional neural network is characterized by comprising the following steps:

step 3, matching the subjects; receiving a first question provided by a user, classifying the first question according to the topic document library constructed in the step 1, and finding out a first topic document corresponding to the first questionN in the first subject document₁A plurality of alternative answers;

step 4, constructing a word vector matrix; dividing the first question into m words, and constructing a first question matrix A according to the word vector model in the step 2; dividing the alternative answers in the first subject document into m words according to the first subject document obtained by subject matching in the step 3, and constructing n words according to the word vector model in the step 2₁Each first answer matrix Q corresponds to one alternative answer; construction n₁A word vector matrix M, wherein M ═ M<A，Q>；

step 6, answer selection process; according to step 5, according to n₁A word vector matrix M, calculating n₁Selecting a matching answer according to the semantic matching degree value;

the input of the attention layer is a problem feature matrix A_cAnd answer feature matrix Q_cOutputting a question sentence vector a and an answer sentence vector b; the weighting formula for the attention layer is: g ═ softmax (tanh (IW + b) e), i ═ GI; wherein, I is an input feature matrix of 4 multiplied by 100, W is a weighting matrix of 100 multiplied by 100, b and e are weighting vectors of 100 dimensions, softmax is a normalization function, G is a weighting weight vector, and I is an output sentence vector; w, b and e are all random values initially, and are optimized by an Adam random optimization method in the network training process; adding the values in the 4-dimensional vector G obtained by the normalization of the softmax function to be 1; problem feature matrix A_cAnd answer feature matrix Q_cSubstituting the above formula, and outputting 100-dimensional question sentence vector a and answer sentence vector b.

2. The method for constructing the convolutional neural network-based document automatic question-answering system according to claim 1, wherein a word segmentation method based on an N-gram model is adopted in a word segmentation process performed on the first question and the first answer.

3. The method for constructing a convolutional neural network-based document automatic question-answering system according to claim 1 or 2, wherein the step 6 comprises the step of connecting n₁And arranging the semantic matching degree values, and selecting the alternative answer corresponding to the maximum semantic matching degree value as the matching answer.

4. The method for constructing a convolutional neural network-based document automatic question-answering system according to claim 1 or 2, wherein the step 6 comprises the step of connecting n₁And arranging the semantic matching degree values, setting a first threshold value, and selecting an answer set of alternative answers corresponding to the semantic matching degree values larger than the first threshold value as matching answers.