CN107562824B - Text similarity detection method - Google Patents

Text similarity detection method Download PDF

Info

Publication number
CN107562824B
CN107562824B CN201710716710.8A CN201710716710A CN107562824B CN 107562824 B CN107562824 B CN 107562824B CN 201710716710 A CN201710716710 A CN 201710716710A CN 107562824 B CN107562824 B CN 107562824B
Authority
CN
China
Prior art keywords
text
similarity
calculating
length
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710716710.8A
Other languages
Chinese (zh)
Other versions
CN107562824A (en
Inventor
龙华
祁俊辉
杜庆治
邵玉斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201710716710.8A priority Critical patent/CN107562824B/en
Publication of CN107562824A publication Critical patent/CN107562824A/en
Application granted granted Critical
Publication of CN107562824B publication Critical patent/CN107562824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a text similarity detection method, and belongs to the technical field of natural language processing. Firstly, carrying out similarity calculation on a text by using a conventional Simhash algorithm; then, an N-Gram language model is introduced to combine the text keywords to enable the keywords to have context engagement relation, and similarity calculation is carried out on the text by using a Simhash algorithm; secondly, introducing the longest common substring as one of criteria for judging similarity, and calculating the similarity of the text; and finally, giving corresponding weight to the similarity obtained by the calculation, and performing superposition calculation of the final similarity. Compared with the prior art, the method mainly solves the problems that the Simhash algorithm has poor support on short texts, effective information is lost in the fingerprint generation process and the like, and improves the accuracy and reliability of text similarity detection.

Description

Text similarity detection method
Technical Field
The invention relates to a text similarity detection method, and belongs to the technical field of natural language processing.
Background
Currently, many learning materials are stored in large-scale data centers. However, the data center is filled with a large number of repeated or similar files, which has a certain effect on the storage space of the data center and the data retrieval of the search engine.
Simhash is a mainstream approximate text detection algorithm at present, but text similarity detection using Simhash still has many problems, such as poor accuracy of short text detection, and Simhash involves multiple dimensionality reduction in a fingerprint generation process, which may cause some effective information to be lost.
Disclosure of Invention
The invention provides a text similarity detection method, which is used for solving the problems of poor support of a Simhash algorithm on short texts, loss of effective information in a fingerprint generation process and the like and increasing the accuracy and reliability of text similarity detection.
The technical scheme of the invention is as follows: a text similarity detection method comprises the following specific steps:
step1, inputting text A and text B;
step2, pre-processing the text A and the text B,obtaining the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight1And calculating the Hamming distance h between the two fingerprints1(ii) a Distance h from Haiming1And length l of generated fingerprint1Calculating the similarity I (A, B) of the text A and the text B based on the Simhash algorithm;
step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively generating 2-Gram sets of the text A and the text B by a Simhash algorithm according to the weight, wherein the length of the 2-Gram sets is l2And calculating the Hamming distance h between the two fingerprints2(ii) a Distance h from Haiming2And length l of generated fingerprint2Calculating the similarity J (A, B) of the text A and the text B based on the N-Gram language model and the Simhash algorithm;
step4, solving the longest common substring of the text A and the text B; from the length l of the longest common substring3And length l of text AAAnd length l of text BBCalculating the similarity Z (A, B) of the text A and the text B based on the longest common substring;
and Step5, setting the weight values corresponding to the similarity calculated in the steps of Step2, Step3 and Step4 as I, J and Z respectively, wherein the weight values I, J and Z meet the requirement that I + J + Z is equal to 1, and calculating the final similarity R (A, B) of the text A and the text B as I (A, B) x I + J (A, B) x J + Z (A, B) x Z by using the similarity I (A, B) and the weight value I, the similarity J (A, B) and the weight value J, and the similarity Z (A, B) and the weight value Z.
In Step1, the input text a and the text B are short texts.
Preprocessing the text A and the text B in the Step2 and the Step3, wherein the preprocessing comprises word segmentation, synonym replacement and stop word removal; and performing segmentation, synonym replacement and stop word by using the segmentation packet, the synonym library and the stop word library respectively.
In Step2, the similarity between text A and text B is calculatedThe formula for I (A, B) is:
Figure BDA0001383916360000021
the formula for calculating the similarity J (a, B) between the text a and the text B in Step3 is as follows:
Figure BDA0001383916360000022
the formula for calculating the similarity Z (a, B) between the text a and the text B in Step4 is as follows:
Figure BDA0001383916360000023
the invention has the beneficial effects that: the invention introduces an N-Gram language model, a longest public substring and the like to improve the Simhash algorithm. Firstly, carrying out similarity calculation on a text by using a conventional Simhash algorithm; then, an N-Gram language model is introduced to combine the text keywords to enable the keywords to have context engagement relation, and similarity calculation is carried out on the text by using a Simhash algorithm; secondly, introducing the longest common substring as one of criteria for judging similarity, and calculating the similarity of the text; and finally, giving the weights corresponding to the similarity obtained by the calculation, and performing superposition calculation of the final similarity. Compared with the prior art, the method mainly solves the problems that the Simhash algorithm has poor support on short texts, effective information is lost in the fingerprint generation process and the like, and improves the accuracy and reliability of text similarity detection.
Drawings
FIG. 1 is a general flow chart of the present invention;
FIG. 2 is a detailed flowchart of Step2 according to the present invention;
FIG. 3 is a detailed flowchart of Step3 according to the present invention;
FIG. 4 is a detailed flowchart of Step4 according to the present invention;
FIG. 5 is a detailed flowchart of Step5 according to the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 5, a text similarity detection method includes the following specific steps:
step1, inputting text A and text B;
the content of the text A is' Xiaoming, your buddy yells you to go to the stadium to play basketball, and then takes dinner in the way! The content of the text B is 'Xiaoming', your buddy calls you to go to playground to play football, and then eat dinner together! ".
Step2, preprocessing the text A and the text B to obtain the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight1And calculating the Hamming distance h between the two fingerprints1(ii) a Distance h from Haiming1And length l of generated fingerprint1And calculating the similarity of the text A and the text B based on the Simhash algorithm
Figure BDA0001383916360000031
Specifically, the method comprises the following steps:
after preprocessing the text, the ideograph of the text a is "xiaoming/you/buddy/yell/you/go/playground/basketball/after/by/together/dinner/", and the ideograph of the text B is "xiaoming/you/buddy/yell/you/go/playground/football/after/together/dinner/".
And a step of calculating TF-IDF values, which is to use a text set as a reference, specifically, 100 local modern novels as the text set for calculating TF-IDF values of the ideograms of the text A and the text B, and generate Simhash fingerprints by using the TF-IDF values of the ideograms of the text A and the text B and a 128-bit Simhash algorithm, wherein the Simhash fingerprints generated by the ideograms of the text A are as follows:
01011110111100111000010001111011011000100100111110111011000011010100100100110110000101001011100011010110100110010101100110111101
the Simhash fingerprint generated by the text B ideogram is as follows:
01011010101000011100101110111010101010001101100111111111101011111100110001110111000111011000000011110100110101011110101000111110
obtaining its Hamming distance h148, then by the formula
Figure BDA0001383916360000032
Calculating the similarity of the text A and the text B:
Figure BDA0001383916360000033
step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively generating 2-Gram sets of the text A and the text B by a Simhash algorithm according to the weight, wherein the length of the 2-Gram sets is l2And calculating the Hamming distance h between the two fingerprints2(ii) a Distance h from Haiming2And length l of generated fingerprint2And calculating the similarity of the text A and the text B based on the N-Gram language model and the Simhash algorithm
Figure BDA0001383916360000041
Specifically, the method comprises the following steps:
applying an N-Gram language model to the preprocessed text actual words to obtain a 2-Gram set of the text A and the text B, and respectively eating dinner together for 'Mingming/your buddy/buddy yell/you go/go to playground/playground basketball/after/by the way/together with the evening/' and 'Mingming/your buddy/yell/you go/playground/football/after/again/together/after/together with the evening/'.
Similarly, 100 local modern novels are used as a text set for calculating TF-IDF values of a 2-Gram set of a text A and a text B, a Simhash fingerprint is generated by the TF-IDF values of the 2-Gram set of the text A and the text B and a 128-bit Simhash algorithm, and the Simhash fingerprint generated by the 2-Gram set of the text A is as follows:
00101111011011010011110100010111110010100110010000110010011010110001001010110011111010100001010001001101110110011100000111101100
the Simhash fingerprint generated by the 2-Gram set of text B is:
10100111011010111001110100010111110000100110010001001011010001111101001010110011101111110101010011001101110010011100010111001100
obtaining its Hamming distance h225, then by the formula
Figure BDA0001383916360000042
Calculating the similarity of the text A and the text B:
Figure BDA0001383916360000043
step4, solving the longest common substring of the text A and the text B; from the length l of the longest common substring3And length l of text AAAnd length l of text BBCalculating the similarity Z (A, B) of the text A and the text B based on the longest common substring; specifically, the method comprises the following steps:
finding the longest common substring of the text A and the text B as' shouting you to the playground by the Xiaoming your buddy
Figure BDA0001383916360000044
Calculating the similarity of the text A and the text B:
Figure BDA0001383916360000045
step5, setting the weights corresponding to the similarity calculated in steps 2, Step3 and Step4 as I, J and Z, respectively, wherein the weights I, J and Z meet the requirement that I + J + Z is 1, and calculating the final similarity R (a, B) of the text a and the text B as I (a, B) x I + J (a, B) x J + Z (a, B) x Z by using the similarity I (a, B) and the weight I, the similarity J (a, B) and the weight J, the similarity Z (a, B) and the weight Z:
assuming that the similarity I (a, B), J (a, B), and Z (a, B) respectively correspond to a weight value I of 0.3, J of 0.6, and Z of 0.1, the final similarity between the text a and the text B is calculated by the formula R (a, B) ═ I (a, B) × I + J (a, B) × J + Z (a, B) × Z:
R(A,B)=I(A,B)×i+J(A,B)×j+Z(A,B)×z
=62.5%×0.3+80.47%×0.6+52.17%×0.1
=72.24%
the above results show that the similarity obtained by the final calculation is 72.24%, which is improved to some extent compared with 62.5% obtained by the conventional Simhash algorithm, especially for short texts (less than 200 words). In addition, because the text set for calculating the TF-IDF value has a great relationship with the final result, the content in the text set should be enriched and the types should be wide as possible in practical application to improve the detection accuracy. In addition, regarding the values of the weights I, J and Z corresponding to the similarities I (a, B), J (a, B) and Z (a, B), the values should be reasonably obtained after multiple detections and appropriate adjustments of different types of texts.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (3)

1. A text similarity detection method is characterized in that: the method comprises the following specific steps:
step1, inputting text A and text B;
step2, preprocessing the text A and the text B to obtain the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight1And calculating the Hamming distance h between the two fingerprints1(ii) a Distance h from Haiming1And length l of generated fingerprint1Calculating the similarity I (A, B) of the text A and the text B based on the Simhash algorithm;
step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively carrying out Simhash algorithm on the text A and the text B according to the weightIs generated to have a length of l2And calculating the Hamming distance h between the two fingerprints2(ii) a Distance h from Haiming2And length l of generated fingerprint2Calculating the similarity J (A, B) of the text A and the text B based on the N-Gram language model and the Simhash algorithm;
step4, solving the longest common substring of the text A and the text B; from the length l of the longest common substring3And length l of text AAAnd length l of text BBCalculating the similarity Z (A, B) of the text A and the text B based on the longest common substring;
step5, setting the weight values corresponding to the similarity calculated in the steps Step2, Step3 and Step4 as I, J and Z respectively, wherein the weight values I, J and Z meet the requirement that I + J + Z is 1, and calculating the final similarity R (A, B) of the text A and the text B as I (A, B) x I + J (A, B) x J + Z (A, B) x Z by using the similarity I (A, B) and the weight value I, the similarity J (A, B) and the weight value J, the similarity Z (A, B) and the weight value Z;
in Step2, the formula for calculating the similarity I (a, B) between the text a and the text B is:
Figure FDA0002483371520000011
the formula for calculating the similarity J (a, B) between the text a and the text B in Step3 is as follows:
Figure FDA0002483371520000012
the formula for calculating the similarity Z (a, B) between the text a and the text B in Step4 is as follows:
Figure FDA0002483371520000021
2. the text similarity detection method according to claim 1, characterized in that: in Step1, the input text a and the text B are short texts.
3. The text similarity detection method according to claim 1, characterized in that: preprocessing the text A and the text B in the Step2 and the Step3, wherein the preprocessing comprises word segmentation, synonym replacement and stop word removal; and performing segmentation, synonym replacement and stop word by using the segmentation packet, the synonym library and the stop word library respectively.
CN201710716710.8A 2017-08-21 2017-08-21 Text similarity detection method Active CN107562824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710716710.8A CN107562824B (en) 2017-08-21 2017-08-21 Text similarity detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710716710.8A CN107562824B (en) 2017-08-21 2017-08-21 Text similarity detection method

Publications (2)

Publication Number Publication Date
CN107562824A CN107562824A (en) 2018-01-09
CN107562824B true CN107562824B (en) 2020-10-27

Family

ID=60976506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710716710.8A Active CN107562824B (en) 2017-08-21 2017-08-21 Text similarity detection method

Country Status (1)

Country Link
CN (1) CN107562824B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595517B (en) * 2018-03-26 2021-03-09 南京邮电大学 Large-scale document similarity detection method
CN108563636A (en) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 Extract method, apparatus, equipment and the storage medium of text key word
CN108681535B (en) * 2018-04-11 2022-07-08 广州视源电子科技股份有限公司 Candidate word evaluation method and device, computer equipment and storage medium
CN108846117A (en) * 2018-06-26 2018-11-20 北京金堤科技有限公司 The duplicate removal screening technique and device of business news flash
CN108920633B (en) * 2018-07-01 2021-12-03 湖北通远格知科技有限公司 Paper similarity detection method
CN109189913B (en) * 2018-08-01 2021-10-22 昆明理工大学 Novel recommendation method based on content
CN111859063B (en) * 2019-04-30 2023-11-03 北京智慧星光信息技术有限公司 Control method and device for monitoring transfer seal information in Internet
CN110334324A (en) * 2019-06-18 2019-10-15 平安普惠企业管理有限公司 A kind of Documents Similarity recognition methods and relevant device based on natural language processing
CN110414251B (en) * 2019-07-31 2021-01-05 北京明朝万达科技股份有限公司 Data monitoring method and device
CN110837555A (en) * 2019-11-11 2020-02-25 苏州朗动网络科技有限公司 Method, equipment and storage medium for removing duplicate and screening of massive texts
CN111813930B (en) * 2020-06-15 2024-02-20 语联网(武汉)信息技术有限公司 Similar document retrieval method and device
CN111753547B (en) * 2020-06-30 2024-02-27 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN112882997B (en) * 2021-02-19 2022-06-07 武汉大学 System log analysis method based on N-gram and frequent pattern mining
CN114596182B (en) * 2022-03-09 2023-05-16 王淑娟 Government affair management method and system based on big data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000079426A1 (en) * 1999-06-18 2000-12-28 The Trustees Of Columbia University In The City Of New York System and method for detecting text similarity over short passages
CN103207864A (en) * 2012-01-13 2013-07-17 北京中文在线数字出版股份有限公司 Online novel content similarity comparison method
CN103294671A (en) * 2012-02-22 2013-09-11 腾讯科技(深圳)有限公司 Document detection method and system
CN106649222A (en) * 2016-12-13 2017-05-10 浙江网新恒天软件有限公司 Text approximately duplicated detection method based on semantic analysis and multiple Simhash

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000079426A1 (en) * 1999-06-18 2000-12-28 The Trustees Of Columbia University In The City Of New York System and method for detecting text similarity over short passages
CN103207864A (en) * 2012-01-13 2013-07-17 北京中文在线数字出版股份有限公司 Online novel content similarity comparison method
CN103294671A (en) * 2012-02-22 2013-09-11 腾讯科技(深圳)有限公司 Document detection method and system
CN106649222A (en) * 2016-12-13 2017-05-10 浙江网新恒天软件有限公司 Text approximately duplicated detection method based on semantic analysis and multiple Simhash

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《中文文本复制检测技术研究》;卢小康;《万方学位论文》;20110328;第14-25页 *
《文档复制检测方法研究与系统实现》;廖兴伟;《万方学位论文》;20140331;第10-25页 *

Also Published As

Publication number Publication date
CN107562824A (en) 2018-01-09

Similar Documents

Publication Publication Date Title
CN107562824B (en) Text similarity detection method
CN106202153B (en) A kind of the spelling error correction method and system of ES search engine
CN107451126B (en) Method and system for screening similar meaning words
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
CN105095204B (en) The acquisition methods and device of synonym
KR101465770B1 (en) Word probability determination
US9483460B2 (en) Automated formation of specialized dictionaries
CN115485696A (en) Countermeasure pretraining of machine learning models
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
CN105210055B (en) According to the hyphenation device across languages phrase table
US10592542B2 (en) Document ranking by contextual vectors from natural language query
WO2022183923A1 (en) Phrase generation method and apparatus, and computer readable storage medium
JP2021197131A (en) Device and method for model training in machine translation, electronic device, program, and recording medium
CN109165382A (en) A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines
CN106980620A (en) A kind of method and device matched to Chinese character string
CN104360993A (en) Method for extracting needed content from text
Francisca et al. Adapting rule based machine translation from english to bangla
CN113641707B (en) Knowledge graph disambiguation method, device, equipment and storage medium
WO2022228127A1 (en) Element text processing method and apparatus, electronic device, and storage medium
CN103336803B (en) A kind of computer generating method of embedding name new Year scroll
Tapsai et al. TLS-ART: Thai language segmentation by automatic ranking trie
Wang et al. Improving handwritten Chinese text recognition by unsupervised language model adaptation
US11941346B2 (en) Systems and methods for long document summarization
CN111104806A (en) Construction method and device of neural machine translation model, and translation method and device
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant