CN107562824B - Text similarity detection method - Google Patents
Text similarity detection method Download PDFInfo
- Publication number
- CN107562824B CN107562824B CN201710716710.8A CN201710716710A CN107562824B CN 107562824 B CN107562824 B CN 107562824B CN 201710716710 A CN201710716710 A CN 201710716710A CN 107562824 B CN107562824 B CN 107562824B
- Authority
- CN
- China
- Prior art keywords
- text
- similarity
- calculating
- length
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention relates to a text similarity detection method, and belongs to the technical field of natural language processing. Firstly, carrying out similarity calculation on a text by using a conventional Simhash algorithm; then, an N-Gram language model is introduced to combine the text keywords to enable the keywords to have context engagement relation, and similarity calculation is carried out on the text by using a Simhash algorithm; secondly, introducing the longest common substring as one of criteria for judging similarity, and calculating the similarity of the text; and finally, giving corresponding weight to the similarity obtained by the calculation, and performing superposition calculation of the final similarity. Compared with the prior art, the method mainly solves the problems that the Simhash algorithm has poor support on short texts, effective information is lost in the fingerprint generation process and the like, and improves the accuracy and reliability of text similarity detection.
Description
Technical Field
The invention relates to a text similarity detection method, and belongs to the technical field of natural language processing.
Background
Currently, many learning materials are stored in large-scale data centers. However, the data center is filled with a large number of repeated or similar files, which has a certain effect on the storage space of the data center and the data retrieval of the search engine.
Simhash is a mainstream approximate text detection algorithm at present, but text similarity detection using Simhash still has many problems, such as poor accuracy of short text detection, and Simhash involves multiple dimensionality reduction in a fingerprint generation process, which may cause some effective information to be lost.
Disclosure of Invention
The invention provides a text similarity detection method, which is used for solving the problems of poor support of a Simhash algorithm on short texts, loss of effective information in a fingerprint generation process and the like and increasing the accuracy and reliability of text similarity detection.
The technical scheme of the invention is as follows: a text similarity detection method comprises the following specific steps:
step1, inputting text A and text B;
step2, pre-processing the text A and the text B,obtaining the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight1And calculating the Hamming distance h between the two fingerprints1(ii) a Distance h from Haiming1And length l of generated fingerprint1Calculating the similarity I (A, B) of the text A and the text B based on the Simhash algorithm;
step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively generating 2-Gram sets of the text A and the text B by a Simhash algorithm according to the weight, wherein the length of the 2-Gram sets is l2And calculating the Hamming distance h between the two fingerprints2(ii) a Distance h from Haiming2And length l of generated fingerprint2Calculating the similarity J (A, B) of the text A and the text B based on the N-Gram language model and the Simhash algorithm;
step4, solving the longest common substring of the text A and the text B; from the length l of the longest common substring3And length l of text AAAnd length l of text BBCalculating the similarity Z (A, B) of the text A and the text B based on the longest common substring;
and Step5, setting the weight values corresponding to the similarity calculated in the steps of Step2, Step3 and Step4 as I, J and Z respectively, wherein the weight values I, J and Z meet the requirement that I + J + Z is equal to 1, and calculating the final similarity R (A, B) of the text A and the text B as I (A, B) x I + J (A, B) x J + Z (A, B) x Z by using the similarity I (A, B) and the weight value I, the similarity J (A, B) and the weight value J, and the similarity Z (A, B) and the weight value Z.
In Step1, the input text a and the text B are short texts.
Preprocessing the text A and the text B in the Step2 and the Step3, wherein the preprocessing comprises word segmentation, synonym replacement and stop word removal; and performing segmentation, synonym replacement and stop word by using the segmentation packet, the synonym library and the stop word library respectively.
the formula for calculating the similarity J (a, B) between the text a and the text B in Step3 is as follows:
the formula for calculating the similarity Z (a, B) between the text a and the text B in Step4 is as follows:
the invention has the beneficial effects that: the invention introduces an N-Gram language model, a longest public substring and the like to improve the Simhash algorithm. Firstly, carrying out similarity calculation on a text by using a conventional Simhash algorithm; then, an N-Gram language model is introduced to combine the text keywords to enable the keywords to have context engagement relation, and similarity calculation is carried out on the text by using a Simhash algorithm; secondly, introducing the longest common substring as one of criteria for judging similarity, and calculating the similarity of the text; and finally, giving the weights corresponding to the similarity obtained by the calculation, and performing superposition calculation of the final similarity. Compared with the prior art, the method mainly solves the problems that the Simhash algorithm has poor support on short texts, effective information is lost in the fingerprint generation process and the like, and improves the accuracy and reliability of text similarity detection.
Drawings
FIG. 1 is a general flow chart of the present invention;
FIG. 2 is a detailed flowchart of Step2 according to the present invention;
FIG. 3 is a detailed flowchart of Step3 according to the present invention;
FIG. 4 is a detailed flowchart of Step4 according to the present invention;
FIG. 5 is a detailed flowchart of Step5 according to the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 5, a text similarity detection method includes the following specific steps:
step1, inputting text A and text B;
the content of the text A is' Xiaoming, your buddy yells you to go to the stadium to play basketball, and then takes dinner in the way! The content of the text B is 'Xiaoming', your buddy calls you to go to playground to play football, and then eat dinner together! ".
Step2, preprocessing the text A and the text B to obtain the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight1And calculating the Hamming distance h between the two fingerprints1(ii) a Distance h from Haiming1And length l of generated fingerprint1And calculating the similarity of the text A and the text B based on the Simhash algorithmSpecifically, the method comprises the following steps:
after preprocessing the text, the ideograph of the text a is "xiaoming/you/buddy/yell/you/go/playground/basketball/after/by/together/dinner/", and the ideograph of the text B is "xiaoming/you/buddy/yell/you/go/playground/football/after/together/dinner/".
And a step of calculating TF-IDF values, which is to use a text set as a reference, specifically, 100 local modern novels as the text set for calculating TF-IDF values of the ideograms of the text A and the text B, and generate Simhash fingerprints by using the TF-IDF values of the ideograms of the text A and the text B and a 128-bit Simhash algorithm, wherein the Simhash fingerprints generated by the ideograms of the text A are as follows:
01011110111100111000010001111011011000100100111110111011000011010100100100110110000101001011100011010110100110010101100110111101
the Simhash fingerprint generated by the text B ideogram is as follows:
01011010101000011100101110111010101010001101100111111111101011111100110001110111000111011000000011110100110101011110101000111110
obtaining its Hamming distance h148, then by the formulaCalculating the similarity of the text A and the text B:
step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively generating 2-Gram sets of the text A and the text B by a Simhash algorithm according to the weight, wherein the length of the 2-Gram sets is l2And calculating the Hamming distance h between the two fingerprints2(ii) a Distance h from Haiming2And length l of generated fingerprint2And calculating the similarity of the text A and the text B based on the N-Gram language model and the Simhash algorithmSpecifically, the method comprises the following steps:
applying an N-Gram language model to the preprocessed text actual words to obtain a 2-Gram set of the text A and the text B, and respectively eating dinner together for 'Mingming/your buddy/buddy yell/you go/go to playground/playground basketball/after/by the way/together with the evening/' and 'Mingming/your buddy/yell/you go/playground/football/after/again/together/after/together with the evening/'.
Similarly, 100 local modern novels are used as a text set for calculating TF-IDF values of a 2-Gram set of a text A and a text B, a Simhash fingerprint is generated by the TF-IDF values of the 2-Gram set of the text A and the text B and a 128-bit Simhash algorithm, and the Simhash fingerprint generated by the 2-Gram set of the text A is as follows:
00101111011011010011110100010111110010100110010000110010011010110001001010110011111010100001010001001101110110011100000111101100
the Simhash fingerprint generated by the 2-Gram set of text B is:
10100111011010111001110100010111110000100110010001001011010001111101001010110011101111110101010011001101110010011100010111001100
obtaining its Hamming distance h225, then by the formulaCalculating the similarity of the text A and the text B:
step4, solving the longest common substring of the text A and the text B; from the length l of the longest common substring3And length l of text AAAnd length l of text BBCalculating the similarity Z (A, B) of the text A and the text B based on the longest common substring; specifically, the method comprises the following steps:
finding the longest common substring of the text A and the text B as' shouting you to the playground by the Xiaoming your buddyCalculating the similarity of the text A and the text B:
step5, setting the weights corresponding to the similarity calculated in steps 2, Step3 and Step4 as I, J and Z, respectively, wherein the weights I, J and Z meet the requirement that I + J + Z is 1, and calculating the final similarity R (a, B) of the text a and the text B as I (a, B) x I + J (a, B) x J + Z (a, B) x Z by using the similarity I (a, B) and the weight I, the similarity J (a, B) and the weight J, the similarity Z (a, B) and the weight Z:
assuming that the similarity I (a, B), J (a, B), and Z (a, B) respectively correspond to a weight value I of 0.3, J of 0.6, and Z of 0.1, the final similarity between the text a and the text B is calculated by the formula R (a, B) ═ I (a, B) × I + J (a, B) × J + Z (a, B) × Z:
R(A,B)=I(A,B)×i+J(A,B)×j+Z(A,B)×z
=62.5%×0.3+80.47%×0.6+52.17%×0.1
=72.24%
the above results show that the similarity obtained by the final calculation is 72.24%, which is improved to some extent compared with 62.5% obtained by the conventional Simhash algorithm, especially for short texts (less than 200 words). In addition, because the text set for calculating the TF-IDF value has a great relationship with the final result, the content in the text set should be enriched and the types should be wide as possible in practical application to improve the detection accuracy. In addition, regarding the values of the weights I, J and Z corresponding to the similarities I (a, B), J (a, B) and Z (a, B), the values should be reasonably obtained after multiple detections and appropriate adjustments of different types of texts.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (3)
1. A text similarity detection method is characterized in that: the method comprises the following specific steps:
step1, inputting text A and text B;
step2, preprocessing the text A and the text B to obtain the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight1And calculating the Hamming distance h between the two fingerprints1(ii) a Distance h from Haiming1And length l of generated fingerprint1Calculating the similarity I (A, B) of the text A and the text B based on the Simhash algorithm;
step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively carrying out Simhash algorithm on the text A and the text B according to the weightIs generated to have a length of l2And calculating the Hamming distance h between the two fingerprints2(ii) a Distance h from Haiming2And length l of generated fingerprint2Calculating the similarity J (A, B) of the text A and the text B based on the N-Gram language model and the Simhash algorithm;
step4, solving the longest common substring of the text A and the text B; from the length l of the longest common substring3And length l of text AAAnd length l of text BBCalculating the similarity Z (A, B) of the text A and the text B based on the longest common substring;
step5, setting the weight values corresponding to the similarity calculated in the steps Step2, Step3 and Step4 as I, J and Z respectively, wherein the weight values I, J and Z meet the requirement that I + J + Z is 1, and calculating the final similarity R (A, B) of the text A and the text B as I (A, B) x I + J (A, B) x J + Z (A, B) x Z by using the similarity I (A, B) and the weight value I, the similarity J (A, B) and the weight value J, the similarity Z (A, B) and the weight value Z;
the formula for calculating the similarity J (a, B) between the text a and the text B in Step3 is as follows:
the formula for calculating the similarity Z (a, B) between the text a and the text B in Step4 is as follows:
2. the text similarity detection method according to claim 1, characterized in that: in Step1, the input text a and the text B are short texts.
3. The text similarity detection method according to claim 1, characterized in that: preprocessing the text A and the text B in the Step2 and the Step3, wherein the preprocessing comprises word segmentation, synonym replacement and stop word removal; and performing segmentation, synonym replacement and stop word by using the segmentation packet, the synonym library and the stop word library respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710716710.8A CN107562824B (en) | 2017-08-21 | 2017-08-21 | Text similarity detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710716710.8A CN107562824B (en) | 2017-08-21 | 2017-08-21 | Text similarity detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107562824A CN107562824A (en) | 2018-01-09 |
CN107562824B true CN107562824B (en) | 2020-10-27 |
Family
ID=60976506
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710716710.8A Active CN107562824B (en) | 2017-08-21 | 2017-08-21 | Text similarity detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107562824B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595517B (en) * | 2018-03-26 | 2021-03-09 | 南京邮电大学 | Large-scale document similarity detection method |
CN108563636A (en) * | 2018-04-04 | 2018-09-21 | 广州杰赛科技股份有限公司 | Extract method, apparatus, equipment and the storage medium of text key word |
CN108681535B (en) * | 2018-04-11 | 2022-07-08 | 广州视源电子科技股份有限公司 | Candidate word evaluation method and device, computer equipment and storage medium |
CN108846117A (en) * | 2018-06-26 | 2018-11-20 | 北京金堤科技有限公司 | The duplicate removal screening technique and device of business news flash |
CN108920633B (en) * | 2018-07-01 | 2021-12-03 | 湖北通远格知科技有限公司 | Paper similarity detection method |
CN109189913B (en) * | 2018-08-01 | 2021-10-22 | 昆明理工大学 | Novel recommendation method based on content |
CN111859063B (en) * | 2019-04-30 | 2023-11-03 | 北京智慧星光信息技术有限公司 | Control method and device for monitoring transfer seal information in Internet |
CN110334324A (en) * | 2019-06-18 | 2019-10-15 | 平安普惠企业管理有限公司 | A kind of Documents Similarity recognition methods and relevant device based on natural language processing |
CN110414251B (en) * | 2019-07-31 | 2021-01-05 | 北京明朝万达科技股份有限公司 | Data monitoring method and device |
CN110837555A (en) * | 2019-11-11 | 2020-02-25 | 苏州朗动网络科技有限公司 | Method, equipment and storage medium for removing duplicate and screening of massive texts |
CN111813930B (en) * | 2020-06-15 | 2024-02-20 | 语联网(武汉)信息技术有限公司 | Similar document retrieval method and device |
CN111753547B (en) * | 2020-06-30 | 2024-02-27 | 上海观安信息技术股份有限公司 | Keyword extraction method and system for sensitive data leakage detection |
CN112882997B (en) * | 2021-02-19 | 2022-06-07 | 武汉大学 | System log analysis method based on N-gram and frequent pattern mining |
CN114596182B (en) * | 2022-03-09 | 2023-05-16 | 王淑娟 | Government affair management method and system based on big data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000079426A1 (en) * | 1999-06-18 | 2000-12-28 | The Trustees Of Columbia University In The City Of New York | System and method for detecting text similarity over short passages |
CN103207864A (en) * | 2012-01-13 | 2013-07-17 | 北京中文在线数字出版股份有限公司 | Online novel content similarity comparison method |
CN103294671A (en) * | 2012-02-22 | 2013-09-11 | 腾讯科技(深圳)有限公司 | Document detection method and system |
CN106649222A (en) * | 2016-12-13 | 2017-05-10 | 浙江网新恒天软件有限公司 | Text approximately duplicated detection method based on semantic analysis and multiple Simhash |
-
2017
- 2017-08-21 CN CN201710716710.8A patent/CN107562824B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000079426A1 (en) * | 1999-06-18 | 2000-12-28 | The Trustees Of Columbia University In The City Of New York | System and method for detecting text similarity over short passages |
CN103207864A (en) * | 2012-01-13 | 2013-07-17 | 北京中文在线数字出版股份有限公司 | Online novel content similarity comparison method |
CN103294671A (en) * | 2012-02-22 | 2013-09-11 | 腾讯科技(深圳)有限公司 | Document detection method and system |
CN106649222A (en) * | 2016-12-13 | 2017-05-10 | 浙江网新恒天软件有限公司 | Text approximately duplicated detection method based on semantic analysis and multiple Simhash |
Non-Patent Citations (2)
Title |
---|
《中文文本复制检测技术研究》;卢小康;《万方学位论文》;20110328;第14-25页 * |
《文档复制检测方法研究与系统实现》;廖兴伟;《万方学位论文》;20140331;第10-25页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107562824A (en) | 2018-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107562824B (en) | Text similarity detection method | |
CN106202153B (en) | A kind of the spelling error correction method and system of ES search engine | |
CN107451126B (en) | Method and system for screening similar meaning words | |
WO2021068339A1 (en) | Text classification method and device, and computer readable storage medium | |
CN105095204B (en) | The acquisition methods and device of synonym | |
KR101465770B1 (en) | Word probability determination | |
US9483460B2 (en) | Automated formation of specialized dictionaries | |
CN115485696A (en) | Countermeasure pretraining of machine learning models | |
CN107895000B (en) | Cross-domain semantic information retrieval method based on convolutional neural network | |
CN105210055B (en) | According to the hyphenation device across languages phrase table | |
US10592542B2 (en) | Document ranking by contextual vectors from natural language query | |
WO2022183923A1 (en) | Phrase generation method and apparatus, and computer readable storage medium | |
JP2021197131A (en) | Device and method for model training in machine translation, electronic device, program, and recording medium | |
CN109165382A (en) | A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines | |
CN106980620A (en) | A kind of method and device matched to Chinese character string | |
CN104360993A (en) | Method for extracting needed content from text | |
Francisca et al. | Adapting rule based machine translation from english to bangla | |
CN113641707B (en) | Knowledge graph disambiguation method, device, equipment and storage medium | |
WO2022228127A1 (en) | Element text processing method and apparatus, electronic device, and storage medium | |
CN103336803B (en) | A kind of computer generating method of embedding name new Year scroll | |
Tapsai et al. | TLS-ART: Thai language segmentation by automatic ranking trie | |
Wang et al. | Improving handwritten Chinese text recognition by unsupervised language model adaptation | |
US11941346B2 (en) | Systems and methods for long document summarization | |
CN111104806A (en) | Construction method and device of neural machine translation model, and translation method and device | |
JP4567025B2 (en) | Text classification device, text classification method, text classification program, and recording medium recording the program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |