CN111310478B

CN111310478B - Similar sentence detection method based on TF-IDF and word vector

Info

Publication number: CN111310478B
Application number: CN202010193466.3A
Authority: CN
Inventors: 刘丹; 赵明; 吴超; 任志愿; 王昊
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2023-09-19
Anticipated expiration: 2040-03-18
Also published as: CN111310478A

Abstract

The invention discloses a similar sentence detection method based on TF-IDF and word vectors, belonging to the technical field of natural language processing. The invention utilizes TF-IDF, adds weight to each word, which is used for reflecting the importance degree of the word and the influence on the calculation of the similarity of the whole sentence, and the more important word has larger TF-IDF value and larger influence on the similarity; meanwhile, aiming at the problem that the similarity comparison of the semantic level is not related to the Jaccard algorithm based on the co-occurrence word and the synonym and the paraphrasing cannot be processed, the invention improves the word vector so that the similarity calculation can be performed on the semantic level. Meanwhile, the method for detecting the similar sentences provided by the invention can not influence the similarity of two sentences after the synonym replacement is carried out on the words in the sentences, so that the method for detecting the similarity of the sentences provided by the invention can be used for searching the Chinese articles, searching the documents and the like.

Description

Similar sentence detection method based on TF-IDF and word vector

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a similar sentence detection method based on TF-IDF and word vectors.

Background

The basic idea of the Jaccard algorithm based on co-occurrence words is: the greater the same portion of two sentences, the greater the number of co-occurring words, the greater the similarity of the two sentences. The proportion of co-occurrence words relative to all words can reflect the similarity of the two sentences in terms of values, and the similarity is expressed as follows:wherein Inter (a, B) represents the word intersection of sentence a and sentence B, union (a, B) represents the word Union of sentence a and sentence B, i·i represents the number of elements of the set, and Sim (a, B) represents the similarity of sentence a and sentence B.

The existing Jaccard algorithm based on co-occurrence words mainly has two problems:

(1) The importance of each word is not considered.

(2) Similarity comparison at the semantic level is not involved, and synonyms and paraphrasing cannot be handled.

When the word2vec is used for natural language processing, each word in the natural language is trained into a group of word vectors through a three-layer neural network. The method well solves the problems that the traditional Bag-of-Words (BOW) cannot represent text context semantic information and dimension disasters are caused, so that semantically similar Words have similar vector representations.

TF-IDF (Term Frequency-inverse Document Frequency, term Frequency and inverse document Frequency) is a statistical method used to evaluate the importance of a Term to a document in a corpus. The TF-IDF is calculated by multiplying the word frequency by the inverse document frequency, i.e., tf×idf. Where Term Frequency (Term Frequency) is the Frequency with which a Term t appears in a document d, and inverse document Frequency (Inverse Document Frequency) represents the category discrimination capability of the Term t, with the fewer documents containing the Term t, the larger the IDF. Wherein, the calculation formula of TF is:the calculation formula of IDF is: />The function f (t, d) represents the number of occurrences of the word t in the document d, df _t The number of the documents containing the word t in the corpus is represented, and N represents the total number of the documents in the corpus. The TF-IDF weights for word t are: tfidf _t ＝tf(t,d)×idf _t . It can be seen that the weight of the term t appears in the document as it appearsThe number of times increases proportionally but at the same time decreases inversely with the frequency of its occurrence in the corpus.

Disclosure of Invention

The invention aims at: aiming at the defects of the Jaccard method based on co-occurrence words, the similarity measurement mode among sentences is improved, and then a similar sentence detection method based on TF-IDF and word vectors is realized, so that the similarity detection of two sentences is not influenced after the words in the sentences are subjected to synonym replacement.

The invention discloses a similar sentence detection method based on TF-IDF and word vectors, which is used for carrying out similar detection processing on sentences A and B and executing the following steps:

step 1: generating word vectors of each word in the sentences A and B to obtain the word vectors of each word in the sentences A and B;

step 2: performing word segmentation processing on the sentences A and B, and removing stop words to obtain word segmentation results of the sentences A and B: a (a) ₁ ,a ₂ ,…,a _i ,…,a _n ) And B (B) ₁ ,b ₂ ,…,b _j ,…,b _m )；

Wherein a is _i I=1, 2, …, n, n representing the i-th word of sentence a, n representing the number of words of sentence a;

b _j j=1, 2, …, m, m represents the number of words of sentence B;

step 3: the TF-IDF weight of each word in the sentences A and B is calculated by adopting the word frequency and inverse document frequency TF-IDF method, and w is defined _i Representation word a _i TF-IDF weights, w _j Representation word b _j TF-IDF weights of (a);

step 4: calculating the cosine value cos of each word in sentence A and each word in sentence B _i ,B _j ) Obtaining a word vector similarity matrix of the sentence A and the sentence B;

wherein A is _i Representation word a _i Corresponding word vector, B _j Representation word b _j Corresponding word vector, downThe subscript i=1, 2, …, n, the subscript j=1, 2, …, m;

step 5: traversing the word vector similarity matrix of sentences A and B, and comparing each cosine value cos (A _i ,B _j ) Comparing with a preset threshold value alpha;

if the current cosine value cos (A _i ,B _j ) Greater than or equal to α, then the word a is considered _i And word b _j For similar parts of two sentences, according to the formulaCalculate cosine value cos (A _i ,B _j ) Corresponding similarity measure +.>

If cos (A) _i ,B _j ) Less than alpha, the word a is considered _i And word b _j Is the dissimilar part of two sentences and is according to the formulaCalculate cosine value cos (A _i ,B _j ) Corresponding dissimilarity measure +.>

Accumulating all similarity metric valuesAnd is denoted as Sum1;

accumulating all dissimilarity measuresAnd is denoted Sum2;

step 6: calculating the similarity Sim (a, B) of the sentence a and the sentence B according to the formula Sim (a, B) =sum 1/(Sum 1+sum 2);

step 7: comparing the similarity Sim (A, B) of the sentences A and B with a preset threshold value beta, and if Sim (A, B) is larger than or equal to beta, judging that the sentences A and B are similar sentences; otherwise, the sentences A and B are judged to be dissimilar sentences.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

the invention utilizes TF-IDF, adds weight (TF-IDF value) to each word, which is used for reflecting the importance degree of the word and the influence of the calculation of the similarity of the whole sentence, and the more important word, the larger the TF-IDF value, the larger the influence on the similarity; meanwhile, aiming at the problem that the similarity comparison of the semantic level is not related to the Jaccard algorithm based on the co-occurrence word and the synonym and the paraphrasing cannot be processed, the invention improves the word vector so that the similarity calculation can be performed on the semantic level. Meanwhile, the sentence similarity method provided by the invention can not influence the similarity of two sentences after the synonym replacement is carried out on the words in the sentences, so that the sentence similarity method provided by the invention can be used for searching the Chinese articles, searching the documents and the like.

Drawings

FIG. 1 is a general flow chart of a similar sentence detection method based on TF-IDF and word vectors.

FIG. 2 is a flowchart showing a method for calculating the similarity of two sentences according to the word vector similarity matrix.

FIG. 3 is a flowchart showing the duplicate checking process according to the similar sentence detecting method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the embodiments and the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.

The invention utilizes TF-IDF to improve the existing Jaccard algorithm based on co-occurrence words, adds weight (TF-IDF value) to each word, and is used for reflecting the importance degree of the word and the influence of the similarity calculation of the whole sentence, and the more important word, the larger the TF-IDF value, the larger the influence on the similarity; aiming at the problem that the similarity comparison of the semantic level is not related to the Jaccard algorithm based on the co-occurrence word and the synonym and the hyponym cannot be processed, the invention improves the similarity calculation by utilizing the word vector so as to carry out the similarity calculation on the semantic level, thereby obtaining the similar sentence detection method based on the TF-IDF and the word vector. The method for detecting similar sentences does not influence the similarity detection of two sentences even after the synonym replacement is carried out on the words in the sentences, so that the method for detecting similar sentences can be used for searching similar documents in Chinese, searching and repeating Chinese articles and other application fields related to sentence comparison.

Taking sentences A and B as examples, the similar sentence detection method based on TF-IDF and word vectors of the invention comprises the following specific implementation steps:

the first step: word2vec is used for training word vectors, and word vectors of each word in sentences are obtained.

And a second step of: using word segmentation tools to segment sentences A and B, and removing stop words, wherein the sentences A and B are respectively expressed as: a (a) ₁ ,a ₂ ,…,a _i ,…,a _n ) And B (B) ₁ ,b ₂ ,…,b _j ,…,b _m ) Where n represents the number of words of sentence A, m represents the number of words of sentence B, a _i The i-th word, b, representing sentence A _j The j-th word of sentence B.

And a third step of: the TF-IDF algorithm is used to obtain TF-IDF values for each word in sentence a and sentence B.

Fourth step: and calculating the cosine value of the included angle between the word vector of each word in the sentence A and the word vector of each word in the sentence B to obtain the word vector similarity matrix of the sentences A and B, wherein the word vector similarity matrix is shown as follows.

TABLE 1 word vector similarity matrix for sentence A and sentence B

Word vector

B ₁

B ₂

……

B _j

……

B _m

A ₁

cos(A ₁ ,B ₁ )

cos(A ₁ ,B ₂ )

……

cos(A ₁ ,B _j )

……

cos(A ₁ ,B _m )

A ₂

cos(A ₂ ,B ₁ )

cos(A ₂ ,B ₂ )

……

cos(A ₂ ,B _j )

……

cos(A ₂ ,B _m )

……

A _i

cos(A _i ,B ₁ )

cos(A _i ,B ₂ )

……

cos(A _i ,B _j )

……

cos(A _i ,B _m )

……

A _n

cos(A _n ,B ₁ )

cos(A _n ,B ₂ )

……

cos(A _n ,B _j )

……

cos(A _n ,B _m )

Wherein A is _i Representation word a _i Corresponding word vector, B _j Representation word b _j Corresponding word vector, cos (A _i ,B _j ) Representing word vector A _i And B _j Cosine of angle (cosine of angle) of (a), i.e. word vector A _i And B _j I=1, 2, …, n, j=1, 2, …, m.

Fifth step: traversing the word vector similarity matrix of sentences A and B, and comparing each cosine value cos (A _i ,B _j ) Comparing with a threshold value alpha;

if the cosine value is greater than or equal to α, then the word a is considered _i And word b _j Is a similar part of two sentences and uses the formula Sum1 = Sum1+ w _i *w _j *cos(A _i ,B _j ) This similar portion is accumulated, with Sum1 having an initial value of 0.

If the cosine value is less than alpha, then the word a is considered _i And word b _j Is a dissimilar part of two sentences and uses the formula Sum2 = Sum2+ w _i *w _j *(1-cos(A _i ,B _j ) The dissimilar part is accumulated, with Sum2 having an initial value of 0.

Wherein w is _i ，w _j Respectively representing the words a calculated in the third step _i And word b _j Preferably, the value range of alpha is set to [0.5,0.9]]。

Sixth step: after the word vector similarity matrix is traversed, the similarity of the sentence a and the sentence B is calculated by using the sentence similar part Sum1 and the sentence dissimilar part Sum2 obtained by the calculation in the fifth step and using the formula Sim (a, B) =sum 1/(Sum 1+sum 2).

Seventh step: comparing the calculated similarity Sim (a, B) of the sentence a and the sentence B with a threshold value β, if Sim (a, B) is greater than or equal to β, determining that the sentence a and the sentence B are similar sentences, and if Sim (a, B) is less than β, determining that the sentence a and the sentence B are dissimilar sentences, preferably, the value range of β is set to [0.7,0.8].

The sentence similarity method provided by the invention can not influence the similarity of two sentences after the synonym replacement is carried out on the words in the sentences, so that the similar sentence detection method provided by the invention can be used for searching the Chinese articles. Referring to fig. 3, the method for detecting similar sentences according to the present invention realizes the duplication checking of articles, and specifically comprises the following steps:

(1) Splitting the article to be checked into sentences, and recording the total number S of sentences of the article to be checked.

(2) All articles in the article library are split into sentences.

(3) And comparing sentences in the articles to be checked with sentences of all the articles in the article library in sequence one by one.

(4) The method for detecting the similar sentences provided by the invention is used for judging whether two sentences which are compared are similar sentences, and if so, the counter count is increased by 1. Wherein the initial value of the counter count is 0;

(5) And (3) repeating the step (3) and the step (4) until all sentences of the article to be checked are compared.

(6) And calculating the repetition rate zeta of the articles to be checked in the article library according to the formula zeta=count/S, and outputting a check repetition result.

When used for the retrieval processing of similar documents, the documents to be retrieved are first split into sentences, and each document in the document library is also split into sentences; respectively comparing sentences of the to-be-searched documents with sentences of each document in a document library, and taking the ratio of the number of the detected similar sentences to the total number of the sentences of the to-be-searched documents as a document similarity rate; if the document similarity between the current document and the document to be searched reaches a preset document similarity threshold, considering the current compared document as the similar document of the document to be searched; and outputs all similar documents of the document to be searched and the corresponding document similarity rate.

While the invention has been described in terms of specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the equivalent or similar purpose, unless expressly stated otherwise; all of the features disclosed, or all of the steps in a method or process, except for mutually exclusive features and/or steps, may be combined in any manner.

Claims

1. The similar sentence detection method based on TF-IDF and word vectors is characterized in that the following steps are executed for sentences A and B to be subjected to similar detection processing:

b _j j=1, 2, …, m, m represents the number of words of sentence B;

step 3: the TF-IDF method is adopted to calculate the TF-IDF weight of each word in the sentence A and the sentence B respectively, and w is defined _i Representation word a _i TF-IDF weights, w _j Representation word b _j TF-IDF weights of (a);

step 4: calculating the cosine value cos of each word in sentence A and each word in sentence B _i ,B _j ) Obtaining a word vector similarity matrix of the sentence A and the sentence B; wherein A is _i Representation word a _i Corresponding word vector, B _j Representation word b _j A corresponding word vector;

if the current cosine value cos (A _i ,B _j ) Greater than or equal to alpha, according to the formulaCalculate cosine value cos (A _i ,B _j ) Corresponding similarity measure +.>

If cos (A) _i ,B _j ) Less than alpha, according to the formulaCalculate cosine value cos (A _i ,B _j ) Corresponding dissimilarity measure +.>

Accumulating all similarity metric valuesAnd is denoted as Sum1;

accumulating all dissimilarity measuresAnd is denoted Sum2;

2. The method of claim 1, wherein in step 1, word vector generation processing is performed on each word in sentence a and sentence B using word vector generation tool word2 vec.

3. The method of claim 1, wherein in step 5, the threshold α is set in a range of values: [0.5,0.9].

4. The method of claim 1, wherein in step 7, the threshold β has a value in the range of: [0.7,0.8].