CN111241254A

CN111241254A - Statement similarity calculation method

Info

Publication number: CN111241254A
Application number: CN201911357252.9A
Authority: CN
Inventors: 陈旋; 王冲; 崇传兵
Original assignee: Jiangsu Aijia Household Products Co Ltd
Current assignee: Jiangsu Aijia Household Products Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-06-05

Abstract

The invention discloses a method for calculating sentence similarity, which relates to the technical field of sentence analysis and specifically comprises the following steps; preparing a data set, and preprocessing each statement of the data set; training the preprocessed sentence set to obtain word vectors; calculating the similarity of the word vectors of the two sentences; using sentence vectors of sentences to perform hierarchical clustering to obtain a knowledge entry tree; and carrying out knowledge recommendation on the target sentence. The invention avoids linear similarity comparison through clustering, reduces the comparison range and improves the performance; the semantic features of the sentences are kept while the performance is improved, and simple character retrieval is avoided.

Description

Statement similarity calculation method

Technical Field

The invention relates to the technical field of statement analysis, in particular to a statement similarity calculation method.

Background

The Question Answering System (QA) is a high-level form of information retrieval System that can answer questions posed by users in natural language with accurate and concise natural language. The question-answering system is realized through a knowledge base in a simple implementation mode, and answers can be recommended only by matching the most similar questions. How to find the most similar problem within the massive knowledge? The traditional mode is either one-by-one linear comparison, and as the knowledge items are increased, the performance is slower and slower; or a part of data is matched by a search engine firstly, and then linear comparison is carried out, so that the search engine can only match sentences with the same words, and different descriptions of the sentences with the same semantics are often ignored.

Disclosure of Invention

The invention provides a method for calculating sentence similarity, which reduces the range of knowledge items of similarity calculation in a hierarchical clustering mode, and meanwhile, clustering is carried out through the characteristics of sentences, and sentences with the same semantics and different descriptions are not ignored.

The invention adopts the following technical scheme for solving the technical problems:

a method for calculating sentence similarity specifically comprises the following steps;

step1, prepare data set Q, Q ═ Q₁，Q₂，Q₃，...，Q_i}; and Q_iIs a sentence;

step2, preprocessing each statement of the data set;

step3, training the preprocessed sentence set to obtain word vectors;

step4, calculating the similarity of the word vectors of the two sentences;

step5, using statement Q_iSentence vector H_iCarrying out hierarchical clustering to obtain a knowledge item tree;

and 6, recommending knowledge to the target sentence.

As a further preferable scheme of the method for calculating sentence similarity, the step2 of preprocessing each sentence in the data set specifically comprises the following steps;

step 2.1, word segmentation: each statement Q_iPerforming word segmentation;

step 2.2, stop words: removing words with little meaning including, but not limited to;

step 2.3, removing numbers, Chinese and English punctuation marks and other non-Chinese meaningless marks, including%, ￥ and!;

step 2.4, word filtering: will get each statement Q_iPerforming word filtering on the word segmentation result to obtain a result set:

S＝{S₁，S₂，S₃，...，S_iin which S is_iFor each word;

according to a word in all sentences Q_iDefining the minimum occurrence frequency min and the maximum occurrence frequency max, and eliminating words with the occurrence frequency less than min and greater than max, wherein the words with the occurrence frequency greater than max are very common and not representative; words with the occurrence number in (min, max) are reserved;

step 2.5, a plurality of repeated words exist in the filtering result, and Q is obtained after the repeated words are removed_iThe de-emphasis word result set T specifically comprises the following steps:

T＝{T₁，T₂，T₃，...，T_i}

as a further preferable scheme of the method for calculating sentence similarity, step3 is specifically as follows:

for all words T_iComputing word vectors

Step 3.1, Single statement Q_iMiddle word T_iThe basic formula of the word frequency value TF is as follows:

wherein, count T_iIs a sentence Q_iWord T in word segmentation result_iNumber of occurrences, count S_iIs a sentence Q_iCalculating the number of all words in the word segmentation result, and calculating the statement Q_iEach word T in the word segmentation result_iThe word frequency of;

step 3.2, Single statement Q_iMiddle word T_iThe basic formula of (a) is as follows:

wherein N represents all statements Q_iNumber of (2), and N (T)_i) Representing the containing word T_iStatement Q of_iTo calculate the statement Q_iEach word T in the word segmentation result_iThe IDF value of (1);

step 3.3, calculating the TF-IDF value of a certain word, wherein the formula is as follows:

TF-IDF(T_i)＝TF(T_i)*IDF(T_i)

step 3.4, each statement Q_iEach word T in the word segmentation result set_iCalculating TF-IDF values respectively, and forming the TF-IDF values into a one-dimensional vector, namely the sentence Q_iSentence vector H_i。

As a further preferable scheme of the method for calculating the sentence similarity, in step4, the cosine similarity formula is used to calculate the similarity of the word vectors of the two sentences, specifically as follows:

the larger the cosine similarity value, the more similar the two statements are.

As a further preferable scheme of the method for calculating sentence similarity, the step5 is specifically as follows:

step 5.1, put each statement Q_iSentence vector H_iAs a class;

step 5.2, two classes with the largest similarity among the classes are searched, the two classes are classified into one class, and the total number of the classes is one less;

if only one knowledge exists in a single class, directly calculating cosine similarity, and calculating an average vector of the two knowledge as a sentence vector of a class node;

if there are subclasses in the single class, respectively calculating the average vector of the sentence vectors from the minimum class upwards in sequence until the class is represented by one average sentence vector, and then calculating the cosine similarity;

and 5.3, repeating the step 5.2 until only one class is left, and obtaining the hierarchical clustering tree.

As a further preferable scheme of the method for calculating sentence similarity, step 6 is specifically as follows:

step 6.1, after the target sentence is processed in the steps 1 to 3, word vectors are assembled together to form a sentence vector D

Step 6.2, respectively calculating cosine similarity of D and each child node with the depth of 1 in the tree, and searching a subtree with the maximum similarity;

step 6.3, in the subtrees, calculating D and the sub-nodes with the depth of 2 respectively to calculate cosine similarity, and searching the subtree with the maximum similarity;

step 6.4, until the most similar sentence is found.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. the invention avoids linear similarity comparison through clustering, reduces the comparison range and improves the performance;

2. the invention improves the performance and simultaneously reserves the semantic features of the sentences, thereby avoiding simple character retrieval;

3. the invention greatly reduces the calculation amount, saves the server resource and reduces the cost;

4. the sentence vector hierarchical clustering tree is constructed in advance, time-consuming tasks are arranged in advance, and the performance of real-time retrieval is improved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic diagram of the hierarchical clustering operation of the present invention;

FIG. 3 is a schematic diagram of a hierarchical clustering tree in accordance with the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention

As shown in fig. 1, a method for calculating sentence similarity specifically includes the following steps;

step1, a data set Q is prepared, wherein Q ═ Q₁，Q₂，Q₃，...，Q_i}；

Step2, preprocessing each statement of the data set;

step3, training the preprocessed sentence set to obtain word vectors;

step4, calculating the similarity of the word vectors of the two sentences;

and 6, recommending knowledge to the target sentence.

step 2.1, word segmentation: each statement Q_iPerforming word segmentation;

S＝{S₁，S₂，S₃，...，S_iin which S is_iFor each word;

according to a word in all sentences Q_iThe minimum occurrence number min and the maximum occurrence number max are defined to have the occurrence number smaller than min and largeEliminating words with max, and if the words with the size larger than max are eliminated, the description words are very common and not representative; the words less than min are few in appearance and strong in characteristics, which affect decision making, so that only the words with the appearance frequency of (min, max) are reserved;

T＝{T₁，T₂，T₃，...，T_i}

for all words T_iComputing word vectors

TF-IDF(T_i)＝TF(T_i)*IDF(T_i)

step 5.1, put each statement Q_iSentence vector H_iAs a class;

step 5.2, two classes with the maximum similarity among the classes are searched, the two classes are classified into one class, and the total number of the classes is reduced by one;

step 6.1, processing the target sentence in the steps 1 to 3 to obtain a sentence vector D formed by assembling word vectors

6.4, until finding the most similar statement;

as shown in fig. 2, a simple hierarchical clustering operation assumes A, B, C, D, E are 5 knowledge entries, corresponding to 5 sentence vectors.

step1, each knowledge item as an independent class, for a total of 5 classes;

step2, finding two types A and B with the maximum similarity, and then totally 4 types, namely { (A, B), C, D, E }; step3, finding the two types D and E with the largest similarity, and then totally 3 types, namely { (A, B), C, (D, E) };

step4, calculating an average sentence vector of (A, B), calculating an average sentence vector of (D, E), then calculating pairwise similarity with C respectively, finding that the similarity of C and (A, B) is greater than the similarity of C and (D, E) is greater than the similarity of (A, B) and (D, E), so that (A, B) and C are classified into a new class, and then 2 classes are obtained in total, namely { ((A, B), C), (D, E) };

step5, leaving only two classes, so merge directly to get the final class { (((A, B), C), (D, E)) }.

The final hierarchical clustering tree is shown in FIG. 3:

in addition to the root node, the sentence vector of each cluster node (i.e., black nodes C1-C4 in the figure) represents the average of the sentence vectors of A and B with method C1; c2 is the average of sentence vectors of C1 and C, and so on.

1) Carrying out knowledge recommendation on the target sentence;

a) processing the target sentence in the steps 1-3 to obtain a sentence vector D;

b) in the tree, respectively calculating cosine similarity of D and each child node with the depth of 1, and searching a subtree with the maximum similarity;

c) in the subtrees, calculating D and the child nodes with the depth of 2 respectively to calculate cosine similarity, and searching the subtree with the maximum similarity;

d) and so on until the most similar sentence is found.

The detailed implementation content of the invention comprises the following steps:

1. preparing a data set:

A. where you are;

B. asking what question is;

C. beijing is very beautiful;

D. this leather boot has a larger number. That number is appropriate;

E. the signal of your side is not heard well and is not clear;

2. preprocessing a data set:

a) word segmentation;

a. [ where you are ]

B. [ asking questions, what, question, etc. ]

C. [ Beijing, very beautiful ]

[ this, leather boot, number, size, and get,. That, only, number, as appropriate ]

E. [ you, that side, signal, not too much, good, listen, not too much, clear ]

b) Stop words:

a. [ where you are ]

B. [ asking questions, what, question, etc. ]

C. [ Beijing, beauty ]

D. [ leather boot, number, large, number, proper ]

E. [ signal, less, good, listen, less, clear ]

c) And (3) word filtering:

a. [ where you are ]

B. [ asking questions, what, question, etc. ]

C. [ Beijing, beauty ]

D. [ leather boot, number, large, number, proper ]

E. [ signal, less, good, listen, less, clear ]

3. And (3) word vector training:

respectively training the data set in the step2 into sentence vectors according to the TF-IDF word vector training method, and constructing a hierarchical clustering tree

4. The target sentence looks for similar sentences:

for example, we need to find a similar sentence of the sentence "the boot number is not small, and the boot number is more suitable";

word segmentation result [ this, only, leather boot, number, not small, that, only, more appropriate ];

stop words [ leather boot, number, small, proper ];

calculating a sentence vector as QV;

when the search is carried out in the hierarchical clustering tree, the most similar sentence can be found to be that the leather boot number is large. That number is appropriate "

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention. While the embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A method of calculating sentence similarity, characterized by: the method specifically comprises the following steps;

step2, preprocessing each statement of the data set;

step3, training the preprocessed sentence set to obtain word vectors;

step4, calculating the similarity of the word vectors of the two sentences;

and 6, recommending knowledge to the target sentence.

2. The method of claim 1, wherein the method comprises: in one embodiment, the step2 of preprocessing each statement of the data set specifically includes the following steps;

step 2.1, word segmentation: each statement Q_iPerforming word segmentation;

S＝{S₁，S₂，S₃，...，S_iin which S is_iFor each word;

T＝{T₁，T₂，T₃，...，T_i}

3. the method of claim 1, wherein the method comprises: in one embodiment, the step3 is specifically as follows:

for all words T_iComputing word vectors

TF-IDF(T_i)＝TF(T_i)*IDF(T_i)

4. The method of claim 1, wherein the method comprises: in one embodiment, in step4, the similarity between the word vectors of the two sentences is calculated by using a cosine similarity formula, which is specifically calculated as follows:

5. The method of claim 1, wherein the method comprises: in one embodiment, the step5 is specifically as follows:

step 5.1, put each statement Q_iSentence vector H_iAs a class;

6. The method of claim 1, wherein the method comprises: in one embodiment, the step 6 is specifically as follows:

step 6.4, until the most similar sentence is found.