CN111310478B - Similar sentence detection method based on TF-IDF and word vector - Google Patents
Similar sentence detection method based on TF-IDF and word vector Download PDFInfo
- Publication number
- CN111310478B CN111310478B CN202010193466.3A CN202010193466A CN111310478B CN 111310478 B CN111310478 B CN 111310478B CN 202010193466 A CN202010193466 A CN 202010193466A CN 111310478 B CN111310478 B CN 111310478B
- Authority
- CN
- China
- Prior art keywords
- word
- sentence
- sentences
- similarity
- idf
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a similar sentence detection method based on TF-IDF and word vectors, belonging to the technical field of natural language processing. The invention utilizes TF-IDF, adds weight to each word, which is used for reflecting the importance degree of the word and the influence on the calculation of the similarity of the whole sentence, and the more important word has larger TF-IDF value and larger influence on the similarity; meanwhile, aiming at the problem that the similarity comparison of the semantic level is not related to the Jaccard algorithm based on the co-occurrence word and the synonym and the paraphrasing cannot be processed, the invention improves the word vector so that the similarity calculation can be performed on the semantic level. Meanwhile, the method for detecting the similar sentences provided by the invention can not influence the similarity of two sentences after the synonym replacement is carried out on the words in the sentences, so that the method for detecting the similarity of the sentences provided by the invention can be used for searching the Chinese articles, searching the documents and the like.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a similar sentence detection method based on TF-IDF and word vectors.
Background
The basic idea of the Jaccard algorithm based on co-occurrence words is: the greater the same portion of two sentences, the greater the number of co-occurring words, the greater the similarity of the two sentences. The proportion of co-occurrence words relative to all words can reflect the similarity of the two sentences in terms of values, and the similarity is expressed as follows:wherein Inter (a, B) represents the word intersection of sentence a and sentence B, union (a, B) represents the word Union of sentence a and sentence B, i·i represents the number of elements of the set, and Sim (a, B) represents the similarity of sentence a and sentence B.
The existing Jaccard algorithm based on co-occurrence words mainly has two problems:
(1) The importance of each word is not considered.
(2) Similarity comparison at the semantic level is not involved, and synonyms and paraphrasing cannot be handled.
When the word2vec is used for natural language processing, each word in the natural language is trained into a group of word vectors through a three-layer neural network. The method well solves the problems that the traditional Bag-of-Words (BOW) cannot represent text context semantic information and dimension disasters are caused, so that semantically similar Words have similar vector representations.
TF-IDF (Term Frequency-inverse Document Frequency, term Frequency and inverse document Frequency) is a statistical method used to evaluate the importance of a Term to a document in a corpus. The TF-IDF is calculated by multiplying the word frequency by the inverse document frequency, i.e., tf×idf. Where Term Frequency (Term Frequency) is the Frequency with which a Term t appears in a document d, and inverse document Frequency (Inverse Document Frequency) represents the category discrimination capability of the Term t, with the fewer documents containing the Term t, the larger the IDF. Wherein, the calculation formula of TF is:the calculation formula of IDF is: />The function f (t, d) represents the number of occurrences of the word t in the document d, df t The number of the documents containing the word t in the corpus is represented, and N represents the total number of the documents in the corpus. The TF-IDF weights for word t are: tfidf t =tf(t,d)×idf t . It can be seen that the weight of the term t appears in the document as it appearsThe number of times increases proportionally but at the same time decreases inversely with the frequency of its occurrence in the corpus.
Disclosure of Invention
The invention aims at: aiming at the defects of the Jaccard method based on co-occurrence words, the similarity measurement mode among sentences is improved, and then a similar sentence detection method based on TF-IDF and word vectors is realized, so that the similarity detection of two sentences is not influenced after the words in the sentences are subjected to synonym replacement.
The invention discloses a similar sentence detection method based on TF-IDF and word vectors, which is used for carrying out similar detection processing on sentences A and B and executing the following steps:
step 1: generating word vectors of each word in the sentences A and B to obtain the word vectors of each word in the sentences A and B;
step 2: performing word segmentation processing on the sentences A and B, and removing stop words to obtain word segmentation results of the sentences A and B: a (a) 1 ,a 2 ,…,a i ,…,a n ) And B (B) 1 ,b 2 ,…,b j ,…,b m );
Wherein a is i I=1, 2, …, n, n representing the i-th word of sentence a, n representing the number of words of sentence a;
b j j=1, 2, …, m, m represents the number of words of sentence B;
step 3: the TF-IDF weight of each word in the sentences A and B is calculated by adopting the word frequency and inverse document frequency TF-IDF method, and w is defined i Representation word a i TF-IDF weights, w j Representation word b j TF-IDF weights of (a);
step 4: calculating the cosine value cos of each word in sentence A and each word in sentence B i ,B j ) Obtaining a word vector similarity matrix of the sentence A and the sentence B;
wherein A is i Representation word a i Corresponding word vector, B j Representation word b j Corresponding word vector, downThe subscript i=1, 2, …, n, the subscript j=1, 2, …, m;
step 5: traversing the word vector similarity matrix of sentences A and B, and comparing each cosine value cos (A i ,B j ) Comparing with a preset threshold value alpha;
if the current cosine value cos (A i ,B j ) Greater than or equal to α, then the word a is considered i And word b j For similar parts of two sentences, according to the formulaCalculate cosine value cos (A i ,B j ) Corresponding similarity measure +.>
If cos (A) i ,B j ) Less than alpha, the word a is considered i And word b j Is the dissimilar part of two sentences and is according to the formulaCalculate cosine value cos (A i ,B j ) Corresponding dissimilarity measure +.>
Accumulating all similarity metric valuesAnd is denoted as Sum1;
accumulating all dissimilarity measuresAnd is denoted Sum2;
step 6: calculating the similarity Sim (a, B) of the sentence a and the sentence B according to the formula Sim (a, B) =sum 1/(Sum 1+sum 2);
step 7: comparing the similarity Sim (A, B) of the sentences A and B with a preset threshold value beta, and if Sim (A, B) is larger than or equal to beta, judging that the sentences A and B are similar sentences; otherwise, the sentences A and B are judged to be dissimilar sentences.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:
the invention utilizes TF-IDF, adds weight (TF-IDF value) to each word, which is used for reflecting the importance degree of the word and the influence of the calculation of the similarity of the whole sentence, and the more important word, the larger the TF-IDF value, the larger the influence on the similarity; meanwhile, aiming at the problem that the similarity comparison of the semantic level is not related to the Jaccard algorithm based on the co-occurrence word and the synonym and the paraphrasing cannot be processed, the invention improves the word vector so that the similarity calculation can be performed on the semantic level. Meanwhile, the sentence similarity method provided by the invention can not influence the similarity of two sentences after the synonym replacement is carried out on the words in the sentences, so that the sentence similarity method provided by the invention can be used for searching the Chinese articles, searching the documents and the like.
Drawings
FIG. 1 is a general flow chart of a similar sentence detection method based on TF-IDF and word vectors.
FIG. 2 is a flowchart showing a method for calculating the similarity of two sentences according to the word vector similarity matrix.
FIG. 3 is a flowchart showing the duplicate checking process according to the similar sentence detecting method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the embodiments and the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.
The invention utilizes TF-IDF to improve the existing Jaccard algorithm based on co-occurrence words, adds weight (TF-IDF value) to each word, and is used for reflecting the importance degree of the word and the influence of the similarity calculation of the whole sentence, and the more important word, the larger the TF-IDF value, the larger the influence on the similarity; aiming at the problem that the similarity comparison of the semantic level is not related to the Jaccard algorithm based on the co-occurrence word and the synonym and the hyponym cannot be processed, the invention improves the similarity calculation by utilizing the word vector so as to carry out the similarity calculation on the semantic level, thereby obtaining the similar sentence detection method based on the TF-IDF and the word vector. The method for detecting similar sentences does not influence the similarity detection of two sentences even after the synonym replacement is carried out on the words in the sentences, so that the method for detecting similar sentences can be used for searching similar documents in Chinese, searching and repeating Chinese articles and other application fields related to sentence comparison.
Taking sentences A and B as examples, the similar sentence detection method based on TF-IDF and word vectors of the invention comprises the following specific implementation steps:
the first step: word2vec is used for training word vectors, and word vectors of each word in sentences are obtained.
And a second step of: using word segmentation tools to segment sentences A and B, and removing stop words, wherein the sentences A and B are respectively expressed as: a (a) 1 ,a 2 ,…,a i ,…,a n ) And B (B) 1 ,b 2 ,…,b j ,…,b m ) Where n represents the number of words of sentence A, m represents the number of words of sentence B, a i The i-th word, b, representing sentence A j The j-th word of sentence B.
And a third step of: the TF-IDF algorithm is used to obtain TF-IDF values for each word in sentence a and sentence B.
Fourth step: and calculating the cosine value of the included angle between the word vector of each word in the sentence A and the word vector of each word in the sentence B to obtain the word vector similarity matrix of the sentences A and B, wherein the word vector similarity matrix is shown as follows.
TABLE 1 word vector similarity matrix for sentence A and sentence B
Word vector | B 1 | B 2 | …… | B j | …… | B m |
A 1 | cos(A 1 ,B 1 ) | cos(A 1 ,B 2 ) | …… | cos(A 1 ,B j ) | …… | cos(A 1 ,B m ) |
A 2 | cos(A 2 ,B 1 ) | cos(A 2 ,B 2 ) | …… | cos(A 2 ,B j ) | …… | cos(A 2 ,B m ) |
…… | …… | …… | …… | …… | …… | …… |
A i | cos(A i ,B 1 ) | cos(A i ,B 2 ) | …… | cos(A i ,B j ) | …… | cos(A i ,B m ) |
…… | …… | …… | …… | …… | …… | …… |
A n | cos(A n ,B 1 ) | cos(A n ,B 2 ) | …… | cos(A n ,B j ) | …… | cos(A n ,B m ) |
Wherein A is i Representation word a i Corresponding word vector, B j Representation word b j Corresponding word vector, cos (A i ,B j ) Representing word vector A i And B j Cosine of angle (cosine of angle) of (a), i.e. word vector A i And B j I=1, 2, …, n, j=1, 2, …, m.
Fifth step: traversing the word vector similarity matrix of sentences A and B, and comparing each cosine value cos (A i ,B j ) Comparing with a threshold value alpha;
if the cosine value is greater than or equal to α, then the word a is considered i And word b j Is a similar part of two sentences and uses the formula Sum1 = Sum1+ w i *w j *cos(A i ,B j ) This similar portion is accumulated, with Sum1 having an initial value of 0.
If the cosine value is less than alpha, then the word a is considered i And word b j Is a dissimilar part of two sentences and uses the formula Sum2 = Sum2+ w i *w j *(1-cos(A i ,B j ) The dissimilar part is accumulated, with Sum2 having an initial value of 0.
Wherein w is i ,w j Respectively representing the words a calculated in the third step i And word b j Preferably, the value range of alpha is set to [0.5,0.9]]。
Sixth step: after the word vector similarity matrix is traversed, the similarity of the sentence a and the sentence B is calculated by using the sentence similar part Sum1 and the sentence dissimilar part Sum2 obtained by the calculation in the fifth step and using the formula Sim (a, B) =sum 1/(Sum 1+sum 2).
Seventh step: comparing the calculated similarity Sim (a, B) of the sentence a and the sentence B with a threshold value β, if Sim (a, B) is greater than or equal to β, determining that the sentence a and the sentence B are similar sentences, and if Sim (a, B) is less than β, determining that the sentence a and the sentence B are dissimilar sentences, preferably, the value range of β is set to [0.7,0.8].
The sentence similarity method provided by the invention can not influence the similarity of two sentences after the synonym replacement is carried out on the words in the sentences, so that the similar sentence detection method provided by the invention can be used for searching the Chinese articles. Referring to fig. 3, the method for detecting similar sentences according to the present invention realizes the duplication checking of articles, and specifically comprises the following steps:
(1) Splitting the article to be checked into sentences, and recording the total number S of sentences of the article to be checked.
(2) All articles in the article library are split into sentences.
(3) And comparing sentences in the articles to be checked with sentences of all the articles in the article library in sequence one by one.
(4) The method for detecting the similar sentences provided by the invention is used for judging whether two sentences which are compared are similar sentences, and if so, the counter count is increased by 1. Wherein the initial value of the counter count is 0;
(5) And (3) repeating the step (3) and the step (4) until all sentences of the article to be checked are compared.
(6) And calculating the repetition rate zeta of the articles to be checked in the article library according to the formula zeta=count/S, and outputting a check repetition result.
When used for the retrieval processing of similar documents, the documents to be retrieved are first split into sentences, and each document in the document library is also split into sentences; respectively comparing sentences of the to-be-searched documents with sentences of each document in a document library, and taking the ratio of the number of the detected similar sentences to the total number of the sentences of the to-be-searched documents as a document similarity rate; if the document similarity between the current document and the document to be searched reaches a preset document similarity threshold, considering the current compared document as the similar document of the document to be searched; and outputs all similar documents of the document to be searched and the corresponding document similarity rate.
While the invention has been described in terms of specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the equivalent or similar purpose, unless expressly stated otherwise; all of the features disclosed, or all of the steps in a method or process, except for mutually exclusive features and/or steps, may be combined in any manner.
Claims (4)
1. The similar sentence detection method based on TF-IDF and word vectors is characterized in that the following steps are executed for sentences A and B to be subjected to similar detection processing:
step 1: generating word vectors of each word in the sentences A and B to obtain the word vectors of each word in the sentences A and B;
step 2: performing word segmentation processing on the sentences A and B, and removing stop words to obtain word segmentation results of the sentences A and B: a (a) 1 ,a 2 ,…,a i ,…,a n ) And B (B) 1 ,b 2 ,…,b j ,…,b m );
Wherein a is i I=1, 2, …, n, n representing the i-th word of sentence a, n representing the number of words of sentence a;
b j j=1, 2, …, m, m represents the number of words of sentence B;
step 3: the TF-IDF method is adopted to calculate the TF-IDF weight of each word in the sentence A and the sentence B respectively, and w is defined i Representation word a i TF-IDF weights, w j Representation word b j TF-IDF weights of (a);
step 4: calculating the cosine value cos of each word in sentence A and each word in sentence B i ,B j ) Obtaining a word vector similarity matrix of the sentence A and the sentence B; wherein A is i Representation word a i Corresponding word vector, B j Representation word b j A corresponding word vector;
step 5: traversing the word vector similarity matrix of sentences A and B, and comparing each cosine value cos (A i ,B j ) Comparing with a preset threshold value alpha;
if the current cosine value cos (A i ,B j ) Greater than or equal to alpha, according to the formulaCalculate cosine value cos (A i ,B j ) Corresponding similarity measure +.>
If cos (A) i ,B j ) Less than alpha, according to the formulaCalculate cosine value cos (A i ,B j ) Corresponding dissimilarity measure +.>
Accumulating all similarity metric valuesAnd is denoted as Sum1;
accumulating all dissimilarity measuresAnd is denoted Sum2;
step 6: calculating the similarity Sim (a, B) of the sentence a and the sentence B according to the formula Sim (a, B) =sum 1/(Sum 1+sum 2);
step 7: comparing the similarity Sim (A, B) of the sentences A and B with a preset threshold value beta, and if Sim (A, B) is larger than or equal to beta, judging that the sentences A and B are similar sentences; otherwise, the sentences A and B are judged to be dissimilar sentences.
2. The method of claim 1, wherein in step 1, word vector generation processing is performed on each word in sentence a and sentence B using word vector generation tool word2 vec.
3. The method of claim 1, wherein in step 5, the threshold α is set in a range of values: [0.5,0.9].
4. The method of claim 1, wherein in step 7, the threshold β has a value in the range of: [0.7,0.8].
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010193466.3A CN111310478B (en) | 2020-03-18 | 2020-03-18 | Similar sentence detection method based on TF-IDF and word vector |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010193466.3A CN111310478B (en) | 2020-03-18 | 2020-03-18 | Similar sentence detection method based on TF-IDF and word vector |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111310478A CN111310478A (en) | 2020-06-19 |
CN111310478B true CN111310478B (en) | 2023-09-19 |
Family
ID=71149880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010193466.3A Active CN111310478B (en) | 2020-03-18 | 2020-03-18 | Similar sentence detection method based on TF-IDF and word vector |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111310478B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6810376B1 (en) * | 2000-07-11 | 2004-10-26 | Nusuara Technologies Sdn Bhd | System and methods for determining semantic similarity of sentences |
JP2013191194A (en) * | 2012-02-15 | 2013-09-26 | Nippon Telegr & Teleph Corp <Ntt> | Document categorizing device, method thereof and program |
CN108628825A (en) * | 2018-04-10 | 2018-10-09 | 平安科技(深圳)有限公司 | Text message Similarity Match Method, device, computer equipment and storage medium |
EP3499384A1 (en) * | 2017-12-18 | 2019-06-19 | Fortia Financial Solutions | Word and sentence embeddings for sentence classification |
CN109947919A (en) * | 2019-03-12 | 2019-06-28 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating text matches model |
CN110334324A (en) * | 2019-06-18 | 2019-10-15 | 平安普惠企业管理有限公司 | A kind of Documents Similarity recognition methods and relevant device based on natural language processing |
CN110852096A (en) * | 2019-06-27 | 2020-02-28 | 暨南大学 | Method for automatically generating Chinese literature reviews |
-
2020
- 2020-03-18 CN CN202010193466.3A patent/CN111310478B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6810376B1 (en) * | 2000-07-11 | 2004-10-26 | Nusuara Technologies Sdn Bhd | System and methods for determining semantic similarity of sentences |
JP2013191194A (en) * | 2012-02-15 | 2013-09-26 | Nippon Telegr & Teleph Corp <Ntt> | Document categorizing device, method thereof and program |
EP3499384A1 (en) * | 2017-12-18 | 2019-06-19 | Fortia Financial Solutions | Word and sentence embeddings for sentence classification |
CN108628825A (en) * | 2018-04-10 | 2018-10-09 | 平安科技(深圳)有限公司 | Text message Similarity Match Method, device, computer equipment and storage medium |
CN109947919A (en) * | 2019-03-12 | 2019-06-28 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating text matches model |
CN110334324A (en) * | 2019-06-18 | 2019-10-15 | 平安普惠企业管理有限公司 | A kind of Documents Similarity recognition methods and relevant device based on natural language processing |
CN110852096A (en) * | 2019-06-27 | 2020-02-28 | 暨南大学 | Method for automatically generating Chinese literature reviews |
Non-Patent Citations (2)
Title |
---|
Document Similarity Detection Using Indonesian Language Word2Vec Model;Nahda Rosa Ramadhanti et al;《2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)》;20200206;全文 * |
面向在线评论的关键词抽取和知识关联研究;韩金波;《中国优秀硕士学位论文全文数据库(电子期刊)》;20180415;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111310478A (en) | 2020-06-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509474B (en) | Synonym expansion method and device for search information | |
Zheng et al. | Learning to reweight terms with distributed representations | |
CN107562717B (en) | Text keyword extraction method based on combination of Word2Vec and Word co-occurrence | |
CN109190117B (en) | Short text semantic similarity calculation method based on word vector | |
CN105279252B (en) | Excavate method, searching method, the search system of related term | |
Hoffart et al. | KORE: keyphrase overlap relatedness for entity disambiguation | |
WO2017101342A1 (en) | Sentiment classification method and apparatus | |
CN104794169B (en) | A kind of subject terminology extraction method and system based on sequence labelling model | |
Al-Khawaldeh et al. | Lexical cohesion and entailment based segmentation for arabic text summarization (lceas) | |
Lan | Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF‐IDF Method | |
Ng et al. | Novelty detection for text documents using named entity recognition | |
Zhang et al. | Keywords extraction based on word2vec and textrank | |
CN114943220B (en) | Sentence vector generation method and duplicate checking method for scientific research establishment duplicate checking | |
CN111428031A (en) | Graph model filtering method fusing shallow semantic information | |
CN117763106A (en) | Document duplicate checking method and device, storage medium and electronic equipment | |
Juan | An effective similarity measurement for FAQ question answering system | |
CN111310478B (en) | Similar sentence detection method based on TF-IDF and word vector | |
Kashefi et al. | Optimizing Document Similarity Detection in Persian Information Retrieval. | |
CN114969324B (en) | Chinese news headline classification method based on subject word feature expansion | |
CN112632287B (en) | Electric power knowledge graph construction method and device | |
Al Ghamdi et al. | Assessment of performance of machine learning based similarities calculated for different English translations of Holy Quran | |
Gepalova et al. | CLEF 2024 JOKER task 1: exploring pun detection using the T5 transformer model | |
CN113590738A (en) | Method for detecting network sensitive information based on content and emotion | |
Agrawal et al. | A graph based ranking strategy for automated text summarization | |
CN113312908B (en) | Sentence similarity calculation method, sentence similarity calculation system and computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |