CN108345585A

CN108345585A - A kind of automatic question-answering method based on deep learning

Info

Publication number: CN108345585A
Application number: CN201810026979.8A
Authority: CN
Inventors: 张引; 张扬扬; 金哲
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-01-11
Filing date: 2018-01-11
Publication date: 2018-07-31

Abstract

The invention discloses a kind of automatic question-answering methods based on deep learning, it is intended to provide based on algorithm, full automatic question answering scheme to the user.The present invention uses the question and answer crawled from website to that as data source, can answer form complex problem.The present invention is on the basis of traditional Similar Problems retrieval, the content of text of problem is expressed as vector using BOW models, TFIDF models and Word2Vec models, Similar Problems are resequenced and screened by calculating the similarity between vector, semantic knowledge can be introduced, it solves the problems, such as the semantic gap in traditional problem retrieving, improves the validity of candidate answers.In addition, based on deep learning, the neural network model that the present invention is obtained using training carries out matching marking to problem and candidate answers, the high-rise matching characteristic between question and answer can be extracted automatically and automatically to the answer gone wrong, the accuracy of automatically request-answering system can be promoted, manual intervention is reduced simultaneously, reduces system development costs.

Description

A kind of automatic question-answering method based on deep learning

Technical field

The present invention relates to the information retrieval in natural language processing field, document representation method, text similarities to calculate, certainly A kind of dynamic question and answer field, and in particular to automatic question-answering method based on deep learning.

Background technology

With the rapid development of Internet, occurring a large amount of electronic document on network, user is in one problem of lookup When answer, traditional information retrieval can not directly give answer, can only provide thousands of web page interlinkage, therefore can be automatic The automatic question answering technology for providing optimum answer has been to be concerned by more and more people.The research of automatic question-answering method is broadly divided into template The method of matched method, method for information retrieval and deep learning.

Method based on template matches needs a large amount of template of manual compiling, and cost is very high, and to the adaptation of new data Property it is bad, once but successful match, the quality of answer are relatively high, the expert system of early stage mainly uses the side of template matches Formula.

Mode based on information retrieval mainly studies Text similarity computing method, and converting problem to mathematical computations asks Topic calculates problem most like therewith to customer problem and obtains candidate answers, then design answer feature by design feature Choose optimum answer.When problem is retrieved, the similarity calculating methods such as most-often used BM25, TFIDF based on character registration, But there are problems that semantic gap, the semanteme of text can not be got a real idea of.

Pass through a large amount of text datas, the automatic high-level semantics features for capturing text, in recent years based on the method for deep learning It quickly grows, a large amount of nerve nets using convolutional neural networks CNN, shot and long term memory network LSTM, Recognition with Recurrent Neural Network RNN Network is proposed that modelling effect has obtained large increase successively.

Invention content

The purpose of the present invention is using document representation method to resequence the Similar Problems of selection, similar ask is chosen The answer of topic carries out matching marking as candidate answers, using neural network model to customer problem and candidate answers, to To optimum answer.

To achieve the above object, the present invention adopts the following technical scheme that：

A kind of automatic question-answering method based on deep learning, includes the following steps：

1) the question and answer data that related field is crawled from internet obtain question and answer pair by surface cleaning, and relationship is arrived in storage In type database, and build full article retrieval.

2) Chinese word segmentation tool is used, all question and answer data are segmented, including addition User Defined dictionary, text It segments and stop words is gone to handle.

3) Word2Vec models are generated to the question and answer data training after participle in step 2), is obtained using Word2Vec models The related term of the term vector of each word and each word.

4) BOW bag of words are built to the question and answer data after participle in step 2), uses BOW models the table of One-hot The mode of showing obtains the BOW vectors of each word, meanwhile, TFIDF models are built to the data question and answer data after participle, are obtained each Word corresponding TFIDF values in each question text.

5) use TFIDF, BOW vector and Word2Vec related terms to all problems structure text representation vector.

6) to user the problem of carries out full-text search and obtains Similar Problems, then carries out Similar Problems based in step 5) The cosine similarity of text representation vector calculates, and after rearrangement, obtains the candidate answers of problem.

7) question and answer Matching Model of the training based on neural network, gives a mark to the matching degree of problem and candidate answers.

8) it when user puts question to, is segmented using step 2), using the text representation vector of step 5) Construct question, is used Step 6) obtains candidate answers, and the highest candidate answers of matching score are obtained as final answer using step 7).

More specifically, the training parameter of the used Word2Vec models of term vector structure of question and answer data is arranged For：Using Skip-gram algorithms, output term vector dimension is 200, and training window size is 5, and the minimum frequency of occurrences of word is 5, is adopted Sample threshold value is 10^-4。

Using TFIDF, BOW vector and Word2Vec related terms all problems are built with the specific method of text representation vector For：First, it to each word in each problem, is multiplied by TFIDF using its BOW vector and is worth to vectorial W1, meanwhile, it uses Word2Vec obtains maximally related 10 words of the word, and the BOW vectors of 10 correlation words are multiplied by inlet coefficient a=respectively It 0.1 and sums and obtains W2, sum W1 and W2 to obtain term vector W3.Then, for each problem, using each word word to Amount W3 sums, and obtains text representation vector.

The problem of being proposed for user first uses full-text search to obtain 500 Similar Problems, then using based on text table The Similarity measures shown obtain the answer of the problem of 100 before ranking as candidate answers, finally use asking based on deep learning It answers the Matching Model acquisition highest answer of matching score and returns to user.

The Chinese word segmentation tool uses the Jieba tools of Python；Word2Vec models use Python's Gensim tools；Neural network model structure uses Tensorflow tools

Neural network structure for question and answer Matching Model is defined as follows shown in table.

Network depth	Title	Convolution kernel size/step-length	Convolution nuclear volume	Full articulamentum neuron number
					1	Embedding	/	/	/
2	BiLSTM-1	/	/	/
					3	Conv-1	3x3/1	128	/
4	Conv-2	3x3/1	256	/
					5	Conv-3	3x3/1	256	/
6	BiLSTM-2	/	/	/
					7	FC-1	/	/	4096
8	FC-2	/	/	4096

The present invention having the beneficial effect that compared with the existing technology：

1) it when the content of text to problem builds text representation vector, introduces related term and carries out joint expression, gained The text representation vector arrived has semantic knowledge, can solve the problems, such as semantic gap.

2) on the basis of traditional Similar Problems retrieval, the step that reorders with semantic knowledge, Neng Gouti are added to The degree of correlation of high Similar Problems, and reduce the range of candidate answers collection, it is possible to reduce the calculation amount of subsequent processing, raising are asked automatically Answer the accuracy and efficiency of system；

3) due to the use of question and answer to as data source, therefore the problem of complex form can be answered, and with data set Expansion, model performance can get a promotion.

4) question and answer Matching Model is had trained using the method based on deep learning, problem and candidate answers can be extracted automatically Between matching characteristic and provide answer automatically, without manual intervention, reduce system development costs.

5) institute's procedures set forth of the present invention has generality, is applicable to the automatically request-answering system of structure different field.

Description of the drawings

Fig. 1 is a kind of overall flow figure of the automatic question-answering method based on deep learning；

Fig. 2 is that the exemplary plot of the Word2Vec term vectors generated is utilized in embodiment.

Fig. 3 is system effect figure in embodiment.

Specific implementation mode

Below in conjunction with specific example and attached drawing, invention is further described in detail.

As shown in Figure 1, bold portion is the structure stage of system, dotted portion corresponds to the service stage of user.

The system structure stage is divided into two pieces, is that deep learning is retrieved and be based on to the Similar Problems based on text representation respectively Question and answer Matching Model, this two pieces of steps are in no particular order.

The structure of Similar Problems retrieval based on text representation is described as：

1) reconciliation phase separation is crawled from internet using the Requests tools of Python and BeautifulSoup tools to close The question and answer data in field.Surface cleaning is carried out to question and answer data first, including deletes unusual character, the format letter in question and answer data Breath, attaching metadata, limitation text size and code conversion and etc..Then the SimHash tools of Python is used to carry out Data deduplication.Finally by obtained question and answer to storing into MySQL database, and examined using Elasticsearch structure full text Rope service.

2) Jieba Chinese word segmentation tools are used, the User Defined dictionary of acquiescence is added, configure stop words dictionary and are enabled Stop words function is gone, Chinese word segmentation is carried out to all question and answer data.

3) the Gensim tools of Python is used to generate Word2Vec models to the question and answer data training after participle in step 2), It is 200 that Skip-gram algorithms, output term vector dimension are used in trained process, and training window size is 5, the minimum appearance of word Frequency is 5, sampling threshold 10-4.200 dimension term vectors of each word, Yi Jiyu are exported using obtained Word2Vec models Maximally related 10 words of each word.

4) dictionary is built to the question and answer data after participle in step 2), the word that frequency of occurrence is less than 50 is ignored when constructing dictionary Language establishes the BOW vectors using One-hot representations using dictionary construction BOW bag of words, and for each word.

5) it uses TFIDF, BOW vector and Word2Vec related terms to all problems structure text representation vector, specifically does Method is：To each word in each problem, it is multiplied by TFIDF using its BOW vector and is worth to vectorial W1, meanwhile, it uses Word2Vec obtains maximally related 10 words of the word, and the BOW vectors of 10 correlation words are multiplied by inlet coefficient a=respectively It 0.1 and sums and obtains W2, sum W1 and W2 to obtain term vector W3.For each problem, by the term vector W3 of each word into Row summation is to get vectorial to the text representation of problem.

6) full-text search is carried out using Elasticsearch to customer problem and obtains 500 Similar Problems, to Similar Problems Text representation vector is converted to by step 5), by the cosine phase for calculating separately the vector between Similar Problems and customer problem The corresponding answer of the problem of like degree, carrying out sequencing of similarity, choosing 100 before ranking is as candidate answers.

The structure of question and answer Matching Model based on deep learning is described as：

1) Q ＆ A database and full article retrieval is utilized to build question and answer data set：It is corresponding to answer to each problem Case is positive sample, randomly selects other 299 answers as negative sample using Elasticsearch, constitutes question and answer data set.

2) question and answer data set is used to train the question and answer Matching Model based on neural network, the matching between question and answer Degree is given a mark.

Complete the structure of entire model above, i.e. bold portion flow in Fig. 1.

In the service stage of user, 1 dotted portion flow of corresponding diagram can be divided mainly into following steps：

1) the problem of being proposed to user obtains similar the asking of 500 problems compositions using Elasticsearch full-text searches Topic collection.

1) the problem of proposing user carries out Chinese word segmentation using Jieba participle tools, and Jieba still needs to be added user Custom Dictionaries go the operations such as stop words processing.

2) BOW vectors, TFIDF the and Word2Vec model constructions customer problem obtained using the structure stage is asked to similar Inscribe the text representation vector of collection.

3) it by calculating the cosine similarity between vector, resequences, selects to customer problem and candidate question set The corresponding answer of the problem of taking 100 before ranking is as candidate answers.

4) the question and answer Matching Model obtained using the structure stage is carried out matching between customer problem and candidate answers and be beaten Point, the answer for choosing highest scoring returns to user.

Embodiment

When user proposes problem " ginger can hair growth ", the process flow of system is described as follows：

1) similar 500 problems, institute are chosen using Elasticsearch full article retrievals from the database of system 500 obtained problems are all the problem of enquirement with user contain common word.

2) it uses Jieba participle tools to carry out Chinese word segmentation, User Defined dictionary is set in Jieba tools, enables and goes Stop words, the problem after participle are " ginger | treatment | alopecia ".

3) TFIDF of customer problem and each word in Similar Problems is calculated.

4) BOW vectors, TFIDF the and Word2Vec models for utilizing the structure stage, to customer problem and Similar Problems structure Build the text representation vector of 200 dimensions, wherein the term vector obtained according to Word2Vec models is as shown in Fig. 2, each word is turned It is changed to the vector of 200 dimensions.

5) by calculate text representation vector between cosine similarity, choose 500 Similar Problems in customer problem 100 most like problems, and obtain the corresponding answer of this 100 problems and constitute candidate answers collection.

6) the question and answer Matching Model based on deep learning generated using the structure stage to problem and each candidate answers into Row matching marking chooses the highest answer of matching score and returns to user, i.e. answer is that " hello, and ginger cannot cure seborrheica Dermatitis, it is proposed that oral vitamin b, cystine, zinc gluconate, externally used compound ketoconazole shampoo clean scalp curing.", such as Shown in Fig. 3.

Claims

1. a kind of automatic question-answering method based on deep learning, it is characterised in that include the following steps：

1) the question and answer data that related field is crawled from internet obtain question and answer pair by surface cleaning, and relationship type number is arrived in storage According in library, and build full article retrieval；

2) Chinese word segmentation tool is used, all question and answer data are segmented, including addition User Defined dictionary, text participle With go stop words to handle；

3) Word2Vec models are generated to the question and answer data training after participle in step 2), is obtained using Word2Vec models each The related term of the term vector of word and each word；

4) BOW bag of words are built to the question and answer data after participle in step 2), uses BOW models the expression side of One-hot Formula obtains the BOW vectors of each word, meanwhile, TFIDF models are built to the data question and answer data after participle, obtain each word The corresponding TFIDF values in each question text；

5) use TFIDF, BOW vector and Word2Vec related terms to all problems structure text representation vector；

6) to user the problem of carries out full-text search and obtains Similar Problems, then carries out Similar Problems based on text in step 5) It indicates that the cosine similarity of vector calculates, after rearrangement, obtains the candidate answers of problem；

7) question and answer Matching Model of the training based on neural network, gives a mark to the matching degree of problem and candidate answers；

8) it when user puts question to, is segmented using step 2), using the text representation vector of step 5) Construct question, uses step 6) candidate answers are obtained, the highest candidate answers of matching score are obtained as final answer using step 7).

2. a kind of automatic question-answering method based on deep learning according to claim 1, it is characterised in that the Chinese Participle tool uses the Jieba tools of Python；Word2Vec models use the Gensim tools of Python；Neural network model Structure uses Tensorflow tools.

3. a kind of automatic question-answering method based on deep learning according to claim 1, it is characterised in that the step 3) parameter of training Word2Vec models is specifically configured in：

Using Skip-gram algorithms, output term vector dimension is 200, and training window size is 5, and the minimum frequency of occurrences of word is 5, Sampling threshold is 10^-4。

4. a kind of automatic question-answering method based on deep learning according to claim 1, it is characterised in that the step 5) it is to the specific method of all problems structure text representation vector with Word2Vec related terms using TFIDF, BOW vector in：

4.1) it to each word in each problem, is multiplied by TFIDF using its BOW vector and is worth to vectorial W1, meanwhile, it uses Word2Vec obtains maximally related 10 words of the word, and the BOW vectors of 10 correlation words are multiplied by inlet coefficient a=respectively It 0.1 and sums and obtains W2, sum W1 and W2 to obtain term vector W3；

4.2) it for each problem, is summed using the term vector W3 of each word in step 4.2), obtains text representation Vector.

5. a kind of automatic question-answering method based on deep learning according to claim 1, it is characterised in that the step 8) it is specially：

5.1) full-text search acquisition is used to put question to similar m problem as Similar Problems with user；

5.2) text representation vector is generated using step 5) with m Similar Problems to customer problem, uses cosine similarity distance It calculates similitude and sorts, the answer of n problem is as candidate answers before being screened from m problem；

5.3) it uses the neural network model in step 7) to obtain the matching score between problem and candidate answers, is chosen after sequence The answer to rank the first returns to user.