CN108021555A

CN108021555A - A kind of Question sentence parsing measure based on depth convolutional neural networks

Info

Publication number: CN108021555A
Application number: CN201711162561.1A
Authority: CN
Inventors: 张家重; 赵亚欧; 付宪瑞; 王玉奎
Original assignee: Inspur Financial Information Technology Co Ltd
Current assignee: Inspur Financial Information Technology Co Ltd
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2018-05-11

Abstract

The present invention provides a kind of Question sentence parsing measure based on depth convolutional neural networks, includes the following steps：S1, generate life corpus by ken related pages, crawls the Chinese character occurred in raw language material, generates the corresponding word vector of each Chinese character；S2, with corresponding word vector replace question sentence in each Chinese character, obtain corresponding to question sentence word vector set；Word vector set is calculated by convolutional neural networks obtains corresponding sentence justice vector；S3, question sentence carry out combination of two, and the cosine function absolute value by calculating sentence justice vector corresponding to two question sentences obtains similarity between two question sentences.This method is avoided due to influence of the cutting word mistake to subsequent analysis by the way of individual character analysis；Whole question sentence is extracted whole sentence feature by convolutional neural networks, avoids and isolates problem using sentence justice caused by word similarity matrix.

Description

Question similarity measurement method based on deep convolutional neural network

Technical Field

The invention relates to a question similarity measuring method, in particular to a question similarity measuring method based on a deep convolutional neural network.

Background

The main functions of the financial self-service robot are business consultation, business handling, cash access, user guidance and the like. The business consultation function can be understood as a Chinese question-answering system aiming at the bank field, and the key technology is to carry out similarity calculation on the questions asked by the user and the questions in a bank question bank and return answers corresponding to the most similar questions. Because natural languages, especially spoken languages, have a variety of different expression modes for questions with the same meaning, how to calculate the similarity between questions according to the real semantics of the questions becomes a problem to be solved urgently.

The traditional question similarity calculation methods generally have two types: one is a keyword matching based approach and the other is a machine learning based approach. The method based on keyword matching mainly calculates the similarity between two sentences by comparing the information of the times, positions, sequences and the like of the same keywords in the two question sentences. The method is simple in calculation, but the processing effect is often poor for long sentences, especially synonyms of different expression modes. The machine learning method mainly analyzes a domain knowledge base to establish a model between question sentences and question sentence semantics to calculate the similarity between different question sentences. The method is complex in calculation, but the method can better process synonyms, so that the method gradually becomes the current mainstream.

In recent years, with the success of deep learning techniques in the fields of voice, images, and the like, they have also been introduced into the field of similarity calculation. As disclosed in the prior art, chinese patent No. CN106776545A, "a method for calculating similarity between short texts through a deep convolutional neural network" is a typical process, which includes first segmenting words in a question, then converting each word into a word vector, and finally inputting a similarity matrix formed by all word vectors in two questions into the convolutional neural network to calculate the similarity.

The method mainly has the following problems:

first, chinese word segmentation cannot be completely accurate, and the accuracy rate is closely related to a specific field. For example, in the field of banks, because of more professional terms, the word segmentation accuracy is generally lower, and the lower accuracy can affect subsequent calculation.

Second, such methods often use a similarity matrix between word vectors as a measure of question similarity, which splits the similarity between questions into similarities between words, destroying the overall semantics of the question.

Disclosure of Invention

In view of the defects in the prior art, the invention aims to provide a question similarity measurement method based on a deep convolutional neural network, which is used for calculating the similarity between questions according to the implicit semantics between the questions.

The purpose of the invention is realized by the following technical scheme: a question similarity measurement method based on a deep convolutional neural network comprises the following steps:

s1, generating a raw corpus through related pages in the knowledge field, crawling Chinese characters appearing in the raw corpus, and generating a corresponding character vector of each Chinese character;

s2, replacing each Chinese character in the question with the corresponding character vector to obtain a character vector set corresponding to the question; the word vector set obtains corresponding sentence meaning vectors through the calculation of a convolution neural network;

and S3, combining the question sentences in pairs, and calculating the cosine function absolute values of the sentence meaning vectors corresponding to the two question sentences to obtain the similarity between the two question sentences.

The technical scheme of the invention is further defined as follows: the method for generating the raw corpus through the knowledge domain related pages in the step S1 comprises the following steps:

s11, compiling a web crawler by using a python language, and crawling knowledge-related webpages;

s12, preprocessing the webpage, removing webpage marks, invalid characters, mathematical formulas, pictures and tables, combining all the webpages, and generating an original raw corpus;

and S13, segmenting the original raw corpus according to punctuations, segmenting each sentence into a plurality of clauses, wherein each clause occupies one line, and combining all the clauses to generate a final raw corpus.

As a further improvement of the present invention, in step 1, a word vector is generated by adopting a skip-gram algorithm of a word2vec tool, the window size of the skip-gram algorithm is set to 2, 3500 common words and a UNK are set in the algorithm, and the UNK is used for replacing uncommon words except the 3500 common words.

As a further improvement of the present invention, further, the convolutional neural network of step S2 includes a convolutional layer and a pooling layer, the convolutional layer adopts a convolutional kernel size of 2 × 200,2 indicates that only the association between 2 single words is considered, 200 is the dimension of the word vector, and the number of convolutional kernels in the convolutional layer is 100-200; the pooling layer employs 1-max pooling, i.e., taking a maximum for each dimension of the features after convolution.

As a further improvement of the present invention, further, the method for calculating and acquiring corresponding sentence meaning vectors by the convolutional neural network in step 2 comprises:

1) The sentence S contains n Chinese characters, each Chinese character corresponds to a d-dimensional character vector v _i Then the sentence after replacement is represented as S' = { v = { ₁ ,v ₂ ,…,v _n }；

2) Inputting S' into convolution layer of convolution neural network to calculate to obtain result after convolutionThe calculation formula is as follows:

wherein, c _i ＝[v _i ,v _i+1 ](0<i&N) being a vector formed by combining two adjacent word vectors, W ^k K-th convolution kernel moment for convolutional neural networkArray, b ^k A deviation vector corresponding to the kth convolution kernel;

3) The result after convolutionInputting the result into a pooling layer for calculation to obtain a result p after pooling ^k； The pooling adopts 1-max pooling, and the calculation formula is as follows:

wherein max is a maximum function representing all inputs of the previous layerThe maximum value of (a);

4) Repeatedly executing the steps 2) and 3); the repetition times are 1-3 times;

5) Output p of the corresponding pooling layer of the last repetition ^k I.e. the sentence meaning vector of sentence S.

As a further improvement of the present invention, in step S3, the formula for calculating the cosine function absolute values of the sentence meaning vectors corresponding to the two question sentences is as follows:

wherein, x and y respectively represent semantic vectors corresponding to the question 1 and the question 2, and the numeric area of sim (x, y) is [0,1].

As a further improvement of the invention, a training method for calculating and acquiring the sentence meaning vector by the convolutional neural network comprises the following steps:

1) Clustering the questions with the same answers according to answers of the questions, combining the questions in the same cluster in pairs to generate positive example samples, combining the questions in different clusters in pairs to generate negative example samples, and combining all the positive example samples and the negative example samples to generate a training set;

2) Configuring a convolutional neural network based on a tensoflow frame, wherein the maximum training time of the convolutional neural network is 1000, a loss function adopts L2 normalized mean square error, the batch size is 400, the characteristic number of network convolutional kernels is 200, the convolutional kernel size is 2 x 200, and a network pooling layer adopts 1-max pooling;

3) Taking a sample in the training set, replacing the sample with a corresponding word vector sample, and calculating a sentence meaning vector corresponding to the word vector sample through the whole neural network;

4) Calculating cosine function absolute values among sample sentence meaning vectors to obtain sample similarity, and adjusting the weight of the neural network according to the error of the similarity with the original sample;

5) And repeating the steps 2) to 4) until the maximum training times are met.

The invention has the outstanding effects that:

(1) And a single character analysis mode is adopted, so that the influence of word segmentation errors on subsequent analysis is avoided.

(2) The convolution neural network takes the whole question as a whole to extract the characteristics of the whole sentence, thereby avoiding the sentence meaning split problem caused by using a word similarity matrix.

(3) And adding an absolute value function on the basis of the original cosine similarity formula to ensure that the value range of the function is [0,1]. The problem that the value ranges of the sigmoid function (a common excitation function of the neural network) and the cosine function are inconsistent is solved, and the situation that the similarity is negative is also avoided.

(4) And generating data required by deep learning model training, and establishing the association between the question and the corresponding semantics.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a convolutional neural network structure according to the present invention;

FIG. 3 is a flowchart of a similarity analysis method according to an embodiment of the present invention;

Detailed Description

Example one

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1-3, the question similarity measurement method based on the deep convolutional neural network of the present invention includes the following steps:

The rules, modes, etc. of operation in the above steps S1 to S3 will be described in detail below,

the embodiment is a similarity detection process of financial field question based on the method of the invention

And S1, generating a financial field corpus. The specific implementation steps are as follows:

step 1, compiling a web crawler by using a python language, and crawling financial related webpages, wherein the website range crawled by the embodiment comprises various large bank websites, financial blocks of various large portal websites, professional financial websites and the like.

And step 2, preprocessing the web pages, removing web page marks, invalid characters, mathematical formulas, pictures, tables and the like, combining all the web pages, and generating the original raw corpus.

And 3, further processing the raw corpus, segmenting the original raw corpus according to punctuations, segmenting each sentence into single clauses, wherein each clause occupies one line, and combining all the clauses to generate a final financial corpus.

And generating a financial field word vector. The specific implementation steps are as follows:

step 1, configuring a word2vec program based on the tensoflow framework. The specific configuration is as follows, the algorithm adopts a skip-gram algorithm, the loss function adopts nce _ loss, the sliding window is 2, the feature size of the word vector is 200, the training times are 3000000, the mini-batch size is 128, the anti-sampling times are 10, and the dictionary size is 3501. The specific implementation process can be adjusted according to the situation.

And 2, utilizing the program to learn the financial corpus generated in the step 1, and generating a corresponding word vector aiming at each word in the dictionary.

Step S2, the calculation method for obtaining the corresponding sentence meaning vector by the convolution neural network calculation comprises the following steps:

step 1, the sentence S contains n Chinese characters, each Chinese character corresponds to a d-dimension character vector v _i Then the sentence after replacement is represented as S' = { v = { ₁ ,v ₂ ,…,v _n }；

Step 2, inputting S' into convolution layer of convolution neural network to calculate to obtain result after convolutionThe calculation formula is as follows:

wherein, c _i ＝[v _i ,v _i+1 ](0<i&N) being a vector formed by combining two adjacent word vectors, W ^k The kth convolution kernel matrix of the convolutional neural network, b ^k A deviation vector corresponding to the kth convolution kernel;

step 3, convolving the resultInputting the result into a pooling layer for calculation to obtain a result p after pooling ^k； Using in a pond1-max pooling, which is calculated by the formula:

where max is a function of the maximum value, indicating that all inputs of the previous layer are takenThe maximum value of (a);

step 4, repeatedly executing the steps 2) and 3); the repetition times are 1-3 times;

step 5, the output p of the corresponding pooling layer is repeated for the last time ^k I.e. the sentence meaning vector of sentence S.

In step S2, the formula for calculating the cosine function absolute values of the sentence meaning vectors corresponding to the two question sentences is:

Step S2, the training step of generating sentence meaning vector by the convolution neural network is as follows:

step 1, clustering the questions with the same answers according to answers of the questions, combining every two questions in the same cluster to generate positive example samples, combining every two questions in different clusters to generate negative example samples, and combining all the positive example samples and the negative example samples to generate a training set.

And step 2, configuring a convolutional neural network program based on the tenserflow framework. The concrete configuration is as follows: the maximum training time of the convolutional neural network is 1000, the loss function adopts L2 normalized Mean Square Error (MSE), the batch size is 400, the feature number of the network convolution kernel is 200, the convolution kernel size is 2 x 200, and the network pooling layer adopts 1-max pooling.

And 3, using the training sample set T generated in the first step.

Step 4, randomly taking a training sample in T, sample = (Sen) ₁ ，Sen ₂ P) by replacing it with a set of two word vectors S _vec1 ,S _vec2 . The specific method comprises the following steps: question-question Sen ₁ The ith Chinese character C _i Using step S ₂ The generated word vectors are replaced. For example, the question "I want to transfer", wherein Chinese characters are "I", "want", "transfer" and "Account", respectively, and assuming that vectors corresponding to "I" are {0.5,0.7,0.6}, vectors corresponding to "want", "transfer" and "Account" are {0.1,0.2,0.5}, {0.2,0.3,0.7}, and {0.9,0.2.0.7}, respectively, the whole sentence is represented as a set of combinations of all vectors, i.e., { {0.5,0.7,0.6}, {0.1,0.2,0.5}, {0.2,0.3,0.7}, and {0.9,0.2.0.7} }.

Step 5, adding S _vec1 、S _vec2 Inputting into the convolutional neural network for calculation to obtain sentence meaning vector S _rep1 、S _rep2 。

And 6, using a formula of cosine function absolute value:calculating S _rep1 、S _rep2 And then adjusting the weight of the neural network according to the error of the similarity with the original sample.

And 7, repeating the steps 2 to 6 until a termination condition (such as reaching the maximum training times or specifying an error) is met. The present embodiment employs the maximum number of training times as a termination condition.

In addition to the above steps, the embodiment may use the trained convolutional neural network to perform question similarity measurement. The method comprises the following specific steps:

step 1, loading a trained convolutional neural network model

And step 2, loading a bank question-answer library. For each question sentence S in the question-answer library _{req_i} Converting it into sentence meaning vector S in step S3 _{rep_i}

Step 3, receiving a question S of the user _request Step S3 is utilized to convert the sentence meaning vector S into a sentence meaning vector S _{rep_request} 。

Step 4, calculating S in sequence by using improved cosine function _{rep_request} And S _{rep_i} Similarity sim of _i Then taking the maximum similarity sim _max Finally, sim _max The answer to the corresponding question is returned as the final answer.

The embodiment adopts a single word analysis mode, and avoids the influence on subsequent analysis due to word segmentation errors. The convolution neural network takes the whole question sentence as a whole to extract the characteristics of the whole sentence, thereby avoiding the sentence meaning splitting problem brought by using the word similarity matrix. And adding an absolute value function on the basis of the original cosine similarity formula to ensure that the value range of the function is [0,1]. The problem that the value ranges of the sigmoid function (a common excitation function of the neural network) and the cosine function are inconsistent is solved, and the situation that the similarity is negative is also avoided. And generating data required by deep learning model training, and establishing the association between the question and the corresponding semantics. In addition to the above embodiments, the present invention may have other embodiments. All technical solutions formed by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the present invention.

Claims

1. A question similarity measurement method based on a deep convolutional neural network is characterized by comprising the following steps:

s2, replacing each Chinese character in the question with the corresponding character vector to obtain a character vector set corresponding to the question; the word vector set obtains corresponding sentence meaning vectors through the calculation of a convolutional neural network;

and S3, combining the question sentences in pairs, and calculating the cosine function absolute value of the sentence meaning vector corresponding to the two question sentences to obtain the similarity between the two question sentences.

2. The method for measuring question similarity based on the deep convolutional neural network according to claim 1, wherein the method for generating the corpus by the knowledge domain related pages in the step S1 is as follows:

3. The question similarity measurement method based on the deep convolutional neural network as claimed in claim 1, wherein a word vector is generated in step 1 by using a skip-gram algorithm of a word2vec tool, the window size of the skip-gram algorithm is set to 2, 3500 common words and an UNK are set in the algorithm, and the UNK is used for replacing uncommon words except 3500 common words.

4. The question similarity measurement method based on the deep convolutional neural network as claimed in claim 1, wherein the convolutional neural network of step S2 includes a convolutional layer and a pooling layer, the convolutional layer uses a convolutional kernel size of 2 x 200,2 indicates that only 2 single words of correlation are considered, 200 is a dimension of a word vector, and the number of convolutional kernels in the convolutional layer is 100-200; the pooling layer employs 1-max pooling, i.e., taking a maximum for each dimension of the features after convolution.

5. The question similarity measurement method based on the deep convolutional neural network as claimed in claim 1, wherein the calculation method for obtaining the corresponding sentence meaning vector by the convolutional neural network in step 2 is as follows:

1) The sentence S contains n Chinese characters, each of which corresponds to a d-dimensional character vector v _i Then the sentence after replacement is represented as S' = { v = { ₁ ,v ₂ ,…,v _n }；

2) Inputting S' into convolution spiritCalculating by the convolution layer of the network to obtain the result after convolutionThe calculation formula is as follows:

wherein, c _i ＝[v _i ,v _i+1 ](0<i&N) is a vector formed by combining two adjacent word vectors, W ^k The kth convolution kernel matrix of the convolutional neural network, b ^k A deviation vector corresponding to the kth convolution kernel;

where max is a function of the maximum value, indicating that all inputs of the previous layer are takenMaximum value of (d);

6. The question similarity measurement method based on the deep convolutional neural network according to claim 1, wherein the formula for calculating the cosine function absolute values of the question meaning vectors corresponding to the two question sentences in step S3 is as follows:

7. The question similarity measurement method based on the deep convolutional neural network as claimed in claim 1, wherein the training method for the convolutional neural network to calculate and obtain sentence meaning vector is as follows:

1) Clustering the questions with the same answers according to the answers of the questions, combining the questions in the same cluster in pairs to generate positive example samples, combining the questions in different clusters in pairs to generate negative example samples, and combining all the positive example samples and the negative example samples to generate a training set;

2) Configuring a convolutional neural network based on a tensoflow frame, wherein the maximum training frequency of the convolutional neural network is 1000, a loss function adopts L2 normalized mean square error, the batch size is 400, the characteristic number of network convolutional kernels is 200, the convolutional kernel size is 2 x 200, and a network pooling layer adopts 1-max pooling;

5) And repeating the steps 2) to 4) until the maximum training times are met.