CN106776545B

CN106776545B - Method for calculating similarity between short texts through deep convolutional neural network

Info

Publication number: CN106776545B
Application number: CN201611076255.1A
Authority: CN
Inventors: 魏笔凡; 郭朝彤; 刘均; 郑庆华; 吴蓓; 郑元浩; 石磊; 吴科炜
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2019-12-24
Anticipated expiration: 2036-11-29
Also published as: CN106776545A

Abstract

The invention discloses a method for calculating similarity between short texts by a deep convolutional neural network, aiming at calculating the similarity between the short texts by using each word appearing in the short texts so as to ensure that the value of the similarity is calculated more accurately, and the adopted technical scheme is as follows: 1) expressing a plurality of short texts into a plurality of matrixes, and sequentially replacing each word in the texts by using corresponding word vectors to obtain an ordered vector sequence which is regarded as a matrix; 2) generating similar matrixes of the two matrixes for representing the target short text; arranging cosine similarity among word vectors to obtain a similarity matrix of the word vectors; 3) tiling the rows and columns of the similar matrix into the same dimension; 4) reducing the dimension of the similarity matrix into a value as the similarity; and for all similar matrixes with the same dimension, training and reducing the dimension of the similar matrixes through a deep convolutional neural network, and calculating the similarity degree through a multi-layer perceptron to represent the value of the similarity degree.

Description

Method for calculating similarity between short texts through deep convolutional neural network

Technical Field

The invention relates to a method for calculating the similarity between texts, in particular to a method for calculating the similarity between short texts through a deep convolutional neural network.

Background

With the development of community question-answer websites, a large number of different types of questions and answers are together, so that users can hardly find useful or interesting contents. One of the methods for solving the problems is to classify the questions and answers of the community question-answering system, so that the users can conveniently search and browse the topics which are interested by the users. The manual classification of these questions and answers requires a great deal of expertise in the knowledge domain, and consumes considerable time and effort. And with the wide application of the community question-answering system, the speed of the occurrence of the questions and answers is gradually increased, and the speed of manual labeling cannot adapt to the speed of the occurrence of the questions and answers. For this reason, it is an urgent task to find an effective short text representation method and perform similarity calculation between texts for the massive fragmentation knowledge of the community question-answering system.

The chinese patent CN201310661778.2, patent number CN201310661778.2, "semantic-based text similarity calculation method" disclosed in the prior art includes three steps: (1) preprocessing a text set, extracting initial characteristic words, and expressing the initial characteristic words into a vector model consisting of keywords and concepts; (2) and then respectively calculating the semantic similarity of the keyword part and the semantic similarity of the concept part, and summing the two parts to finally obtain the semantic similarity of the text.

The above patent calculates the similarity between texts by calculating the semantic similarity of the keyword part and the semantic similarity of the concept part, respectively, and the keywords and the concepts cannot replace the entire texts. Therefore, the basis for text similarity calculation in the above patent is incomplete, and cannot completely represent the similarity between two text segments.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for calculating the similarity between short texts through a deep convolutional neural network, which can calculate the similarity between the short texts by using each word appearing in the short texts, so that the calculation of the similarity value is more accurate.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the method comprises the following steps:

1) a number of short texts are represented as a number of matrices: crawling words appearing in related pages of all knowledge fields on Wikipedia as a word list, training the word list, obtaining a word vector by each word, and sequentially replacing each word in the short text by the corresponding word vector to obtain an ordered vector sequence which is regarded as a matrix;

2) combining a plurality of short texts pairwise, and generating similar matrixes for the matrixes of the two short texts in each group: for two short texts in each group, taking two corresponding matrixes, sequentially calculating cosine similarity between word vectors of the two short texts, and arranging the cosine similarity to obtain a similar matrix of the two short texts in each group;

3) tiling the rows and columns of similar matrices into the same dimension: counting the number of rows and columns of all the existing similar matrixes, respectively finding the maximum number of rows and the maximum number of columns, and tiling all the similar matrixes by taking the maximum number of rows and the maximum number of columns as a reference to ensure that the similar matrixes have the same number of rows and the same number of columns, which are called similar matrixes with the same dimensionality;

4) and reducing the similarity matrix into a value as the similarity: and for all similar matrixes with the same dimension, training and reducing the dimension of the similar matrixes through a deep convolutional neural network, and calculating the similarity degree through a multi-layer perceptron to represent the similarity value to complete the similarity calculation between the short texts.

In the step 1), word lists are trained by adopting open source codes of word2vec published by Google on the internet.

And in the step 1), when words appearing in all relevant pages of knowledge fields on Wikipedia are crawled, words appearing repeatedly and words combining letters and numbers are eliminated.

Preprocessing the short text before sequentially replacing words in the short text with word vectors in the step 1): the words and punctuation marks of the letter and number combination are removed first, then stop words are defined and eliminated.

The specific steps of generating the similar matrix for the matrix of the two short texts in each group in the step 2) are as follows:

2.1) respectively taking a word vector from the two matrixes, wherein the two vectors are a and b respectively, and the cosine similarity calculation formula is as follows:

wherein a · b represents the dot product of vector a and vector b; the | a | | and | b | | | represent the moduli of the vector a and the vector b, respectively;

2.2) for each row vector of the two matrixes, sequentially calculating the cosine similarity of the two matrixes, and taking the cosine similarity as a value on the corresponding position of the similarity matrix, wherein the calculation formula is as follows:

σ_ij＝cos(Q_i,W_j)

wherein Q is_iA row vector, W, representing the ith row of the matrix Q_jThe row vector, σ, representing the jth row of the matrix W_ijAnd representing the corresponding values of the ith row and the jth column of the similar matrix, wherein the row number of the obtained similar matrix is the same as that of the matrix Q, and the column number of the similar matrix is the same as that of the matrix W.

In the step 2.1), if the dimension of the word vector is fixed, the calculation formula of the cosine similarity of the two vectors a and b is as follows:

wherein x is_kRepresenting the value corresponding to the k-th dimension in the vector x.

The specific process of tiling the rows and columns of the similar matrix into the same dimension in step 3) is as follows:

3.1) counting the number of rows and columns of all similar matrixes to find the maximum row number row_maxAnd the maximum number of columns col_max；

3.2) when tiling the matrix, closely arrange the matrix until the number of rows is row_maxThe number of columns is col_maxIf the required dimension cannot be obtained exactly, the redundant part is deleted.

The specific process of reducing the dimension of the similarity matrix into a value as the similarity in the step 4) is as follows:

4.1) training all similar matrixes with the same dimension by using a deep convolutional neural network, and reducing the dimension of each similar matrix into a vector after sequentially passing through two convolutional layers, two pooling layers and a full-connection layer;

4.2) using a multilayer perceptron to process vectors obtained by dimensionality reduction of the deep convolutional neural network, and finally, carrying out dimensionality reduction on each vector to obtain two values, namely a similarity degree and a dissimilarity degree, wherein the value of the similarity degree represents the value of the similarity between the short texts.

Compared with the prior art, the similarity between short texts is calculated by utilizing the deep convolutional neural network, words are expressed into a vector form through training, the quantization of the texts is realized, the similarity matrix is constructed through the calculation of cosine similarity between word vectors, all the similarity matrices are tiled to have the same dimensionality, the loss of characteristics cannot be generated in the tiling process, and finally the similarity is calculated by adopting the deep convolutional neural network and the multilayer perceptron. A plurality of short texts are expressed into a plurality of matrixes, and the short texts which are not expressed in a quantization mode are expressed into a matrix form which can be quantized, so that the similarity among the texts can be conveniently calculated. The short texts are combined pairwise, the similar matrix of the two short texts in each group is generated, the cosine similarity between word vectors is sequentially calculated to generate the similar matrix between the two short texts, the cosine similarity is simple to calculate, and the similar matrix lays a foundation for calculating the similarity. Tiling the rows and columns of the similar matrix into the same dimension uses a tiling method to unify the rows and columns of the similar matrix into the same dimension without causing loss of features. The similarity matrix is reduced into a value serving as similarity, the similarity matrix is trained through a deep convolutional neural network and a multilayer perceptron, and the similarity is calculated, so that a model for calculating the text similarity can be well trained, and the accuracy of similarity value calculation is improved.

Drawings

FIG. 1 is a block flow diagram of the present invention;

FIG. 2 is a diagram of a model for generating a similarity matrix according to the present invention;

fig. 3 is a diagram of a tiling method of the matrix of the present invention.

Detailed Description

The invention is further explained below with reference to specific embodiments and the drawing of the description.

The invention comprises the following steps:

(1) the short texts are represented as a plurality of matrixes: firstly, selecting words appearing in related pages of all knowledge fields on Wikipedia as a word list; then, training a word list by using open source codes of word2vec published by Google on the network, wherein each word is represented as a vector; finally, sequentially replacing each word in the text by using corresponding word vectors in the word list, wherein each word vector occupies one line to obtain an ordered vector sequence which can be regarded as a matrix, and the line number is the number of words;

(2) combining a plurality of short texts pairwise, and generating similar matrixes for the matrixes of the two short texts in each group: firstly, for two sections of texts, taking two corresponding matrixes in the step (1), respectively taking a vector from the two matrixes, and calculating cosine values of the two vectors to obtain the similarity condition of the two corresponding sections of texts in a statistical method; secondly, calculating the cosine similarity between each row vector of the two similar matrixes, and taking the cosine similarity as a value on the corresponding position of the similar matrixes; finally, a completely filled similar matrix is obtained;

(3) tiling the rows and columns of similar matrices into the same dimension: firstly, counting the number of rows and columns of the existing similar matrix in the step (2) to find the maximum number of rows and the maximum number of columns; secondly, tiling all similar matrixes by taking the maximum row number and the maximum column number as the reference, so that the similar matrixes have the same dimensionality;

(4) the similarity matrix is reduced to a value as the similarity: and (4) for all the similar matrixes with the same dimension obtained in the step (3), training and reducing the dimension of the similar matrixes through a deep convolutional neural network, and calculating the similarity degree through a multi-layer perceptron to represent the value of the similarity degree.

Referring to fig. 1, the present invention specifically includes the following four processes:

(1) the method comprises the following steps of representing a plurality of short texts into a plurality of matrixes, wherein the method comprises the following steps:

step 1: crawling words appearing in related pages of all knowledge fields on the Wikipedia page; eliminating repeated occurrences of words, eliminating words like "t 1" that are combinations of letters and numbers;

step 2: training a word list by using open source codes issued by Google on the network, wherein each word is expressed as a vector, and finally obtaining the word list;

and 3, step 3: preprocessing short text data, namely removing a combination of letters and numbers similar to t1, removing punctuation marks in the text, then defining stop words and removing the stop words;

and 4, step 4: sequentially replacing each word in the short text by using the word vectors in the word list, wherein each word vector occupies one line, an ordered vector sequence can be obtained and can be regarded as a matrix, and the line number is the number of words;

(2) generating similarity matrixes of two matrixes representing target short texts, wherein the similarity matrixes comprise 2 steps:

step 1: assuming that the two vectors are a and b respectively, the cosine similarity between the vectors is calculated as follows:

wherein a · b represents the dot product of vector a and vector b; the terms a and b represent the moduli of the vector a and the vector b, respectively. Since the word vector dimensions are fixed (assuming n dimensions), the calculation method can be expressed as follows:

wherein x is_kRepresenting the value corresponding to the kth dimension in the vector x;

step 2: for each row vector of the two matrixes, the cosine similarity is calculated in turn and is used as the value of the corresponding position of the similar matrix, and the calculation method is as follows:

σ_ij＝cos(Q_i,W_j) (3)

wherein Q is_iA row vector, W, representing the ith row of the matrix Q_jThe row vector, σ, representing the jth row of the matrix W_ijRepresenting the corresponding values of the ith and jth columns of the similar matrix, the number of rows of the obtained similar matrix being the same as the number of rows of the matrix Q, and the number of columns of the similar matrix being the same as the number of rows of the matrix W, e.g.FIG. 2 is a schematic illustration;

(3) tiling the rows and columns of a similar matrix into the same dimension, comprising 2 steps:

step 1: counting the number of rows and columns of all the similar matrixes at present, and finding out the maximum row number row_maxAnd the maximum number of columns col_max；

Step 2: when the matrix is tiled, the matrix is closely arranged until the number of rows is row_maxThe number of columns is col_max. If the required dimension cannot be obtained exactly, the redundant part is deleted, as shown in fig. 3;

(4) the dimension reduction of the similarity matrix into a value as the similarity comprises 2 steps:

step 1: training all similar matrixes with the same dimension by using a deep convolutional neural network, and reducing the dimensions of each similar matrix into a vector after passing through two convolutional layers, two pooling layers and a full-connection layer;

step 2: and (3) processing vectors obtained by dimensionality reduction of the deep convolutional neural network by using a multilayer perceptron, and finally, performing dimensionality reduction on each vector to obtain two values, namely a similarity degree and a dissimilarity degree, wherein the value of the similarity degree represents the value of the similarity between the short texts.

Claims

1. A method for calculating similarity between short texts through a deep convolutional neural network is characterized by comprising the following steps:

4) and reducing the similarity matrix into a value as the similarity: for all similar matrixes with the same dimension, training and reducing the dimension of the similar matrixes through a deep convolutional neural network, and calculating the similarity degree through a multi-layer perceptron to represent the similarity value to complete the similarity calculation between the short texts; the specific process is as follows:

2. The method of claim 1, wherein the word list is trained in step 1) by using open source code of word2vec published by Google on the internet.

3. The method for calculating the similarity between short texts through the deep convolutional neural network as claimed in claim 2, wherein the words appearing in all pages related to the knowledge domain in Wikipedia in step 1) are crawled to eliminate repeated words and words combined by letters and numbers.

4. The method for calculating the similarity between the short texts through the deep convolutional neural network as claimed in claim 3, wherein the short texts are preprocessed before words in the short texts are sequentially replaced by word vectors in step 1): the words and punctuation marks of the letter and number combination are removed first, then stop words are defined and eliminated.

5. The method for calculating similarity between short texts through the deep convolutional neural network as claimed in claim 1, wherein the specific steps of generating the similarity matrix for the matrix of the two short texts in each group in step 2) are as follows:

σ_ij＝cos(Q_i，W_j)

6. The method of claim 5, wherein the dimension of the word vector in step 2.1) is fixed, and the cosine similarity between the two vectors a and b is calculated according to the following formula:

7. The method for calculating the similarity between short texts through the deep convolutional neural network as claimed in claim 1, wherein the specific process of tiling the rows and the columns of the similarity matrix into the same dimension in step 3) is as follows: