CN111813927A

CN111813927A - Sentence similarity calculation method based on topic model and LSTM

Info

Publication number: CN111813927A
Application number: CN201910292541.9A
Authority: CN
Inventors: 曹秀亭
Original assignee: Potevio Information Technology Co Ltd
Current assignee: Potevio Information Technology Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2020-10-23

Abstract

The application discloses a sentence similarity calculation method based on a topic model and LSTM, which comprises the following steps: the following processing is performed on the two sentences respectively: generating corresponding word vectors and topic vectors according to sentences, fusing the word vectors and the topic vectors to obtain fused vectors, and taking the fused vectors as the input of an LSTM layer to obtain corresponding LSTM output; and (3) taking the LSTM output of the two sentences as the input of a full-connection layer, and obtaining the similarity of the two sentences after Dropout and regularization processing. The application also discloses a corresponding system. By applying the technical scheme disclosed by the application, the accuracy of sentence similarity calculation can be improved.

Description

Sentence similarity calculation method based on topic model and LSTM

Technical Field

The application relates to the technical field of sentence matching, in particular to a sentence similarity calculation method based on a topic model and a long-short term memory network (LSTM).

Background

In the intelligence question-answering system, mainly include 3 modules: question understanding, information retrieval and answer extraction. In practical application, in order to avoid difficulty in understanding natural language, question matching can achieve a good effect under the condition of having question-answer pairs. Question matching requires matching natural language questions input by a user with questions of question-answer pairs in a question-answer system, obtaining questions of the question-answer pairs through matching, and further obtaining answers.

In the process of question matching, sentence similarity calculation is required. The sentence similarity calculation is based on natural language processing technology to calculate the similarity between two sentences. Because the question is generally short and belongs to the category of short text similarity calculation, the related method of short text similarity calculation can be used for reference.

In the prior art, the most widely applied method is to take a 'bag of words' as a basic unit without considering the complete semantic meaning of the whole sentence expression. The vector space model is the most common question similarity calculation model, and calculates the weight of each word through TF-IDF, then calculates the distance between words through cosine values by using the weights as word vectors, and finally outputs sentence similarity. The specific process is as follows:

firstly, converting a question sentence into individual characteristic words through a word bag model, and then obtaining TF-IDF values of the characteristic words by using a statistical method. Wherein, the TF (Term Frequency) value is the number of times of appearance of the characteristic word in the question sentence. Assuming that a feature word appears twice in question 1, its TF value is 2. The IDF (Inverse text frequency index) value is determined by the number of occurrences of the feature word in all question templates, for example, 10 total occurrences of the feature word in question, and then, for question 1, the IDF value of the feature word is as follows:

wherein N represents the number of the sets of question sentences in the question sentence library;

df_t10, this feature word appears in 10 question sentences.

And the TF-IDF value is obtained by multiplying the TF value by the IDF value. The TF-IDF value is calculated based on a statistical weight calculation mode, and the statistical method is an effective feature item weight measurement method through practical tests under the condition that corpus features contained in the global text set are sufficient.

Each sentence can be changed into a vector through TF-IDF, and the cosine value between every two vectors is calculated, so that the similarity between the two sentences is finally obtained.

However, the above conventional method for calculating sentence similarity by using "bag of words" loses some information, does not consider semantic information of question sentences, simply judges the similarity of sentences according to editing distance and vocabulary matching method, and has low robustness. There is a clear disadvantage if the similarity analysis is performed purely using word vectors of single words: word order is not considered and word vector distinction is ambiguous. For example the following two sentences:

1. calculating the similarity between the high school by the first school graduation and the high school by the high school graduation according to the method, wherein the calculation result is 1; obviously, the two sentences actually express the completely opposite meanings.

2. The similarity between the work painstaking and the work ease is calculated according to the method, and the result is probably very high; and the meanings of the two phrases in fact are intended to be diametrically opposite.

Disclosure of Invention

The application provides a sentence similarity calculation method and system based on a topic model and an LSTM, so as to improve the accuracy of sentence similarity calculation.

The application discloses a sentence similarity calculation method based on a topic model and LSTM, which comprises the following steps:

the following processing is performed on the two sentences respectively: generating corresponding word vectors and topic vectors according to sentences, fusing the word vectors and the topic vectors to obtain fused vectors, and taking the fused vectors as the input of an LSTM layer to obtain corresponding LSTM output;

and (3) taking the LSTM output of the two sentences as the input of a full-connection layer, and randomly discarding a part of neuron units Dropout and carrying out regularization treatment to obtain the similarity of the two sentences.

Preferably, generating the corresponding word vector from the sentence includes:

reading in a sentence, segmenting the sentence, and numbering each word;

generating a number vector of each sentence according to the word number; wherein, the sentence adopts fixed length, and the insufficient position is filled with zero;

the word number is saved and the word vector is saved.

Preferably, generating the corresponding topic vector from the sentence includes:

inputting the word vector of the sentence into a SennceLDA model to obtain the probability distribution theta of the sentence on each topic;

and converting the probability distribution theta of the sentence on each topic into a corresponding topic vector through linear conversion.

Preferably, when the topic of the sentence is extracted, the number of word clusters is specified, clustering is carried out simultaneously, and words in the sentence are expanded through similar words.

The application also discloses a sentence similarity calculation system based on the topic model and the LSTM, which is characterized by comprising the following steps: the system comprises a word vector processing module, a theme vector processing module, an LSTM layer, a full connection layer, a Dropout module and a regularization module, wherein:

the word vector processing module, the theme vector processing module and the LSTM layer respectively perform the following processing on two sentences:

the word vector processing module generates corresponding word vectors according to the sentences;

the theme vector processing module obtains a corresponding theme vector according to the word vector;

the LSTM layer takes a vector obtained by fusing a word vector and a theme vector as input to obtain corresponding LSTM output;

the full-connection layer is used for processing LSTM output of the two sentences, and the processing result is processed by the Dropout module and the regularization module to obtain the similarity of the two sentences.

Preferably, the word vector processing module is specifically configured to:

reading in a sentence, segmenting the sentence, and numbering each word;

the word number is saved and the word vector is saved.

Preferably, the theme vector processing module is specifically configured to:

According to the technical scheme, the question sentence is an irregular short text and is not necessarily standard in grammatical structure, so that the problem is effectively avoided by adopting a vector mode. The sentence similarity is calculated by combining the improved topic model and the LSTM, so that the system can notice semantic information and jointly match sentences by means of word vectors. Compared with the traditional TF-IDF algorithm, the invention has the capability of understanding the subject of the question and the associated information between the upper and lower words.

The sentence similarity calculation method and system based on the topic model and the LSTM, which are provided by the invention, provide a simple, efficient and extensible topic model by combining the original SenenteLDA topic model, expand the words in the sentence by improving the clustering algorithm, which is equivalent to the background knowledge of the words in the sentence, and fully improve the semantic information of the sentence. In addition, the method creatively uses the idea of convolutional neural network to combine the topic vector with the word vector of the sentence and inputs the combined topic vector into the LSTM network, thereby effectively improving the accuracy of sentence similarity calculation.

Drawings

FIG. 1 is a schematic diagram of a network architecture of a question matching system based on a topic model and LSTM according to the present invention;

FIG. 2 is a schematic diagram of the Sentence LDA model;

FIG. 3 is a schematic diagram of a generation algorithm for word and topic collections;

FIG. 4 is a schematic diagram illustrating the principle of sentence similarity calculation based on topic model and LSTM according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below by referring to the accompanying drawings and examples.

The invention establishes a sentence matching system based on a topic model and LSTM, and the network architecture is shown in figure 1:

first, the following processing is performed on two sentences, respectively: generating corresponding word vectors and topic vectors according to sentences, fusing the word vectors and the topic vectors to obtain fused vectors, and taking the fused vectors as the input of an LSTM layer to obtain corresponding LSTM output;

then, the LSTM outputs of the two sentences are used as the input of the fully connected layer, and after a part of the neural units Dropout and the regularization process are discarded, the similarity of the two sentences is obtained. Among them, Dropout processing is processing for randomly discarding a part of neuron units in order to prevent overfitting.

The technical scheme of the invention is further explained in detail in the following sections.

First, generating word vector

In order to solve the word order problem of sentences, the invention adopts LSTM (Long Short-Term Memory network) to represent the sentences again, and measures the similarity of the sentences through a plurality of full connection layers.

LSTM is a time-recursive neural network, natural language cannot be directly used as input of the neural network, and therefore, sentences need to be encoded to obtain corresponding word vectors, and the process corresponds to embedding-layer-1 and embedding-layer-2 in fig. 1.

The process of generating a word vector comprises the steps of:

1. reading in a sentence, segmenting the sentence, and numbering each word;

2. generating a number vector of each sentence according to the word number, wherein the sentences adopt fixed lengths, and the insufficient positions are filled with zero;

3. and the word numbers are stored in the file, and the word vectors are stored, so that the prediction and the use are convenient.

Secondly, generating a theme vector

The invention adopts a Sentence LDA topic model to extract question topics. The probability of the same word appearing in different subject backgrounds is different; the probability of occurrence in different sentences is also different for the same topic.

The sententitlet lda is an extension of lda (last Dirichlet allocation) with the goal of overcoming the limitation of data sparsity by incorporating text structures in the generation and reasoning process. LDA and sententilde differ in that: the latter assumes that there are very strong potential topic dependencies between words of a sentence, while the former is primarily independence between words of a sentence.

The sententitle lda adopts a deep learning method to vectorize sentences, and uses co-occurrence information of local context words. For example: using the first n-1 words, it is predicted what the next word is. The method essentially utilizes word co-occurrence information in the range of n words, and the main idea is to use global topic information to predict the probability of word occurrence in a sentence.

Obtaining probability distribution theta (called 'theme distribution' for short) of the sentence on each theme through the Sentence LDA training; then, the theme distribution θ is converted into a corresponding theme vector through linear transformation.

The sententitle lda model is shown in fig. 2:

in fig. 2, K represents the number of topics, D represents the number of sentences in the corpus, N represents the number of words in a sentence, and S represents the number of words under a certain topic.

The probability formula for the sententitle model is as follows:

p(w，z|α，β)＝p(w|z，β)p(z|α)

wherein: w is a word, z represents a topic, and α and β are two independent sentences under the same topic. And the corresponding probability relation between the theme and the words can be obtained through the theme distribution under alpha and the words under beta theme distribution.

When determining the topic distribution θ of a sentence, it needs to be based on the corresponding relationship between words and topics, and the corresponding relationship is obtained through training, and a specific generation algorithm of a set of words and topics is shown in fig. 3, and includes the following steps:

step 1: the document collection is processed.

Step 2: and judging whether all the documents are selected, if not, continuing to execute the 3 rd step and the 4 th step, and if all the documents are selected, ending the flow.

And 3, step 3: the number of sample sentences in the selected document belongs to the execution 5 step of the poisson distribution.

And 4, step 4: the sample topic of the selection blend is the execution 5 th step belonging to the dirichlet distribution.

And 5, step 5: and judging whether all sentences are selected, if so, returning to the step 1, and otherwise, continuing the steps 6 and 7.

And 6, step 6: the selection of the number of sample words is step 8 of the execution belonging to the poisson distribution.

And 7, step 7: the select sample topic is step 8 of execution belonging to a polynomial distribution.

And 8, step 8: and judging whether all the words are selected, if so, returning to the step 3, and if not, continuing to execute the step 9.

Step 9: sample words belonging to a polynomial distribution are retained.

Step 10: and storing the corresponding relation between the words and the topics, and ending the process.

The topic model can encounter the problem of sparse information when abstracting the topic of the sentence, therefore, the invention adopts an improved clustering algorithm to expand the words in the sentence, and solves the problem of sparse information through similar words.

The clustering algorithm adopts a clustering algorithm based on the coacervation hierarchy, which is a bottom-up clustering algorithm. The existing agglomeration hierarchical clustering algorithm has low efficiency, and the invention improves the algorithm through the idea of distribution, specifies the number of word clusters and carries out clustering at the same time. In actual training, preferably, the word cluster can be classified into 6 categories: people, places, numbers, times, entities, unknowns, and the subclasses are classified into 30 classes.

Third, LSTM layer and subsequent processing

After a sentence is encoded to obtain a corresponding word vector, a word vector map of words in the sentence needs to be prepared as an input of the LSTM layer. The neural network adopts a simple single-layer LSTM and a full connection layer to train data, and the structure diagram of the network is shown in figure 4:

the section first encodes and maps the input sentences (as shown, "i want you" and "you want me") into a corresponding word vector list (as shown x1x2x3 in LSTM-a and x1x2x3 in LSTM _ b), and combines the word vectors with the topic vectors using the idea of convolutional neural networks, resulting in hidden vectors h1h2h 3. By innovatively adopting the combination mode, the invention can better fuse the relationship between the theme and the words, and finally, the LSTM layer outputs corresponding to two sentences are respectively obtained according to the hidden layer vector, such as y1 and y2 shown in FIG. 4. Then, the outputs of the two LSTMs are spliced and used as the input of the full connection layer, and the result is finally output after Dropout and Batchnormalization regularization. The sentence similarity calculation model can be obtained by training sentences in the training sentence set.

According to the method, when the sentence similarity is calculated, the topic vector is added, and the topic vector is the probability distribution of the sentence and each topic, so that the weight of the words is implicitly increased according to the topic of the sentence, the judgment of the sentence similarity by the LSTM is better improved, the semantic information of the sentence is fully considered, the sentence similarity is judged by combining the word information of the sentence, and the accuracy is greatly improved compared with that of the traditional method.

Based on the network architecture shown in fig. 1, the system for calculating sentence similarity based on the topic model and the LSTM of the present invention comprises the following processing modules: the system comprises a word vector processing module, a theme vector processing module, an LSTM layer, a full connection layer, a Dropout module and a regularization module, wherein:

Preferably, the word vector processing module is specifically configured to:

reading in a sentence, segmenting the sentence, and numbering each word;

the word number is saved and the word vector is saved.

Preferably, the theme vector processing module is specifically configured to:

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A sentence similarity calculation method based on a topic model and LSTM is characterized by comprising the following steps:

2. The method of claim 1, wherein generating a corresponding word vector from a sentence comprises:

reading in a sentence, segmenting the sentence, and numbering each word;

the word number is saved and the word vector is saved.

3. The method of claim 2, wherein generating the corresponding topic vector from the sentence comprises:

4. A method according to any one of claims 1 to 3, characterized in that:

when the sentence theme is extracted, the number of word clusters is specified, clustering is carried out at the same time, and words in the sentence are expanded through similar words.

5. A system for calculating sentence similarity based on topic models and LSTM, comprising: the system comprises a word vector processing module, a theme vector processing module, an LSTM layer, a full connection layer, a Dropout module and a regularization module, wherein:

6. The system of claim 5, wherein the word vector processing module is specifically configured to:

reading in a sentence, segmenting the sentence, and numbering each word;

the word number is saved and the word vector is saved.

7. The system of claim 6, wherein the theme vector processing module is specifically configured to:

8. The system according to any one of claims 5 to 7, wherein: