CN107562824B

CN107562824B - Text similarity detection method

Info

Publication number: CN107562824B
Application number: CN201710716710.8A
Authority: CN
Inventors: 龙华; 祁俊辉; 杜庆治; 邵玉斌
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-08-21
Filing date: 2017-08-21
Publication date: 2020-10-27
Anticipated expiration: 2037-08-21
Also published as: CN107562824A

Abstract

The invention relates to a text similarity detection method, and belongs to the technical field of natural language processing. Firstly, carrying out similarity calculation on a text by using a conventional Simhash algorithm; then, an N-Gram language model is introduced to combine the text keywords to enable the keywords to have context engagement relation, and similarity calculation is carried out on the text by using a Simhash algorithm; secondly, introducing the longest common substring as one of criteria for judging similarity, and calculating the similarity of the text; and finally, giving corresponding weight to the similarity obtained by the calculation, and performing superposition calculation of the final similarity. Compared with the prior art, the method mainly solves the problems that the Simhash algorithm has poor support on short texts, effective information is lost in the fingerprint generation process and the like, and improves the accuracy and reliability of text similarity detection.

Description

Text similarity detection method

Technical Field

The invention relates to a text similarity detection method, and belongs to the technical field of natural language processing.

Background

Currently, many learning materials are stored in large-scale data centers. However, the data center is filled with a large number of repeated or similar files, which has a certain effect on the storage space of the data center and the data retrieval of the search engine.

Simhash is a mainstream approximate text detection algorithm at present, but text similarity detection using Simhash still has many problems, such as poor accuracy of short text detection, and Simhash involves multiple dimensionality reduction in a fingerprint generation process, which may cause some effective information to be lost.

Disclosure of Invention

The invention provides a text similarity detection method, which is used for solving the problems of poor support of a Simhash algorithm on short texts, loss of effective information in a fingerprint generation process and the like and increasing the accuracy and reliability of text similarity detection.

The technical scheme of the invention is as follows: a text similarity detection method comprises the following specific steps:

step1, inputting text A and text B;

step2, pre-processing the text A and the text B,obtaining the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight₁And calculating the Hamming distance h between the two fingerprints₁(ii) a Distance h from Haiming₁And length l of generated fingerprint₁Calculating the similarity I (A, B) of the text A and the text B based on the Simhash algorithm;

step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively generating 2-Gram sets of the text A and the text B by a Simhash algorithm according to the weight, wherein the length of the 2-Gram sets is l₂And calculating the Hamming distance h between the two fingerprints₂(ii) a Distance h from Haiming₂And length l of generated fingerprint₂Calculating the similarity J (A, B) of the text A and the text B based on the N-Gram language model and the Simhash algorithm;

step4, solving the longest common substring of the text A and the text B; from the length l of the longest common substring₃And length l of text A_AAnd length l of text B_BCalculating the similarity Z (A, B) of the text A and the text B based on the longest common substring;

and Step5, setting the weight values corresponding to the similarity calculated in the steps of Step2, Step3 and Step4 as I, J and Z respectively, wherein the weight values I, J and Z meet the requirement that I + J + Z is equal to 1, and calculating the final similarity R (A, B) of the text A and the text B as I (A, B) x I + J (A, B) x J + Z (A, B) x Z by using the similarity I (A, B) and the weight value I, the similarity J (A, B) and the weight value J, and the similarity Z (A, B) and the weight value Z.

In Step1, the input text a and the text B are short texts.

Preprocessing the text A and the text B in the Step2 and the Step3, wherein the preprocessing comprises word segmentation, synonym replacement and stop word removal; and performing segmentation, synonym replacement and stop word by using the segmentation packet, the synonym library and the stop word library respectively.

In Step2, the similarity between text A and text B is calculatedThe formula for I (A, B) is:

the formula for calculating the similarity J (a, B) between the text a and the text B in Step3 is as follows:

the formula for calculating the similarity Z (a, B) between the text a and the text B in Step4 is as follows:

the invention has the beneficial effects that: the invention introduces an N-Gram language model, a longest public substring and the like to improve the Simhash algorithm. Firstly, carrying out similarity calculation on a text by using a conventional Simhash algorithm; then, an N-Gram language model is introduced to combine the text keywords to enable the keywords to have context engagement relation, and similarity calculation is carried out on the text by using a Simhash algorithm; secondly, introducing the longest common substring as one of criteria for judging similarity, and calculating the similarity of the text; and finally, giving the weights corresponding to the similarity obtained by the calculation, and performing superposition calculation of the final similarity. Compared with the prior art, the method mainly solves the problems that the Simhash algorithm has poor support on short texts, effective information is lost in the fingerprint generation process and the like, and improves the accuracy and reliability of text similarity detection.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a detailed flowchart of Step2 according to the present invention;

FIG. 3 is a detailed flowchart of Step3 according to the present invention;

FIG. 4 is a detailed flowchart of Step4 according to the present invention;

FIG. 5 is a detailed flowchart of Step5 according to the present invention.

Detailed Description

Example 1: as shown in fig. 1 to 5, a text similarity detection method includes the following specific steps:

step1, inputting text A and text B;

the content of the text A is' Xiaoming, your buddy yells you to go to the stadium to play basketball, and then takes dinner in the way! The content of the text B is 'Xiaoming', your buddy calls you to go to playground to play football, and then eat dinner together! ".

Step2, preprocessing the text A and the text B to obtain the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight₁And calculating the Hamming distance h between the two fingerprints₁(ii) a Distance h from Haiming₁And length l of generated fingerprint₁And calculating the similarity of the text A and the text B based on the Simhash algorithm

Specifically, the method comprises the following steps:

after preprocessing the text, the ideograph of the text a is "xiaoming/you/buddy/yell/you/go/playground/basketball/after/by/together/dinner/", and the ideograph of the text B is "xiaoming/you/buddy/yell/you/go/playground/football/after/together/dinner/".

And a step of calculating TF-IDF values, which is to use a text set as a reference, specifically, 100 local modern novels as the text set for calculating TF-IDF values of the ideograms of the text A and the text B, and generate Simhash fingerprints by using the TF-IDF values of the ideograms of the text A and the text B and a 128-bit Simhash algorithm, wherein the Simhash fingerprints generated by the ideograms of the text A are as follows:

01011110111100111000010001111011011000100100111110111011000011010100100100110110000101001011100011010110100110010101100110111101

the Simhash fingerprint generated by the text B ideogram is as follows:

01011010101000011100101110111010101010001101100111111111101011111100110001110111000111011000000011110100110101011110101000111110

obtaining its Hamming distance h₁48, then by the formula

Calculating the similarity of the text A and the text B:

step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively generating 2-Gram sets of the text A and the text B by a Simhash algorithm according to the weight, wherein the length of the 2-Gram sets is l₂And calculating the Hamming distance h between the two fingerprints₂(ii) a Distance h from Haiming₂And length l of generated fingerprint₂And calculating the similarity of the text A and the text B based on the N-Gram language model and the Simhash algorithm

Specifically, the method comprises the following steps:

applying an N-Gram language model to the preprocessed text actual words to obtain a 2-Gram set of the text A and the text B, and respectively eating dinner together for 'Mingming/your buddy/buddy yell/you go/go to playground/playground basketball/after/by the way/together with the evening/' and 'Mingming/your buddy/yell/you go/playground/football/after/again/together/after/together with the evening/'.

Similarly, 100 local modern novels are used as a text set for calculating TF-IDF values of a 2-Gram set of a text A and a text B, a Simhash fingerprint is generated by the TF-IDF values of the 2-Gram set of the text A and the text B and a 128-bit Simhash algorithm, and the Simhash fingerprint generated by the 2-Gram set of the text A is as follows:

00101111011011010011110100010111110010100110010000110010011010110001001010110011111010100001010001001101110110011100000111101100

the Simhash fingerprint generated by the 2-Gram set of text B is:

10100111011010111001110100010111110000100110010001001011010001111101001010110011101111110101010011001101110010011100010111001100

obtaining its Hamming distance h₂25, then by the formula

Calculating the similarity of the text A and the text B:

step4, solving the longest common substring of the text A and the text B; from the length l of the longest common substring₃And length l of text A_AAnd length l of text B_BCalculating the similarity Z (A, B) of the text A and the text B based on the longest common substring; specifically, the method comprises the following steps:

finding the longest common substring of the text A and the text B as' shouting you to the playground by the Xiaoming your buddy

Calculating the similarity of the text A and the text B:

step5, setting the weights corresponding to the similarity calculated in steps 2, Step3 and Step4 as I, J and Z, respectively, wherein the weights I, J and Z meet the requirement that I + J + Z is 1, and calculating the final similarity R (a, B) of the text a and the text B as I (a, B) x I + J (a, B) x J + Z (a, B) x Z by using the similarity I (a, B) and the weight I, the similarity J (a, B) and the weight J, the similarity Z (a, B) and the weight Z:

assuming that the similarity I (a, B), J (a, B), and Z (a, B) respectively correspond to a weight value I of 0.3, J of 0.6, and Z of 0.1, the final similarity between the text a and the text B is calculated by the formula R (a, B) ═ I (a, B) × I + J (a, B) × J + Z (a, B) × Z:

R(A,B)＝I(A,B)×i+J(A,B)×j+Z(A,B)×z

＝62.5％×0.3+80.47％×0.6+52.17％×0.1

＝72.24％

the above results show that the similarity obtained by the final calculation is 72.24%, which is improved to some extent compared with 62.5% obtained by the conventional Simhash algorithm, especially for short texts (less than 200 words). In addition, because the text set for calculating the TF-IDF value has a great relationship with the final result, the content in the text set should be enriched and the types should be wide as possible in practical application to improve the detection accuracy. In addition, regarding the values of the weights I, J and Z corresponding to the similarities I (a, B), J (a, B) and Z (a, B), the values should be reasonably obtained after multiple detections and appropriate adjustments of different types of texts.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A text similarity detection method is characterized in that: the method comprises the following specific steps:

step1, inputting text A and text B;

step2, preprocessing the text A and the text B to obtain the meaning words; respectively solving TF-IDF values of the ideographs of the text A and the text B as the weight of the ideograph; respectively generating the length l of the ideogram of the text A and the text B by a Simhash algorithm according to the weight₁And calculating the Hamming distance h between the two fingerprints₁(ii) a Distance h from Haiming₁And length l of generated fingerprint₁Calculating the similarity I (A, B) of the text A and the text B based on the Simhash algorithm;

step3, preprocessing the text A and the text B to obtain the meaning words; obtaining a 2-Gram set of a text A and a text B by using an N-Gram language model; solving the TF-IDF value of each compound word in the 2-Gram set as the weight of the compound word; respectively carrying out Simhash algorithm on the text A and the text B according to the weightIs generated to have a length of l₂And calculating the Hamming distance h between the two fingerprints₂(ii) a Distance h from Haiming₂And length l of generated fingerprint₂Calculating the similarity J (A, B) of the text A and the text B based on the N-Gram language model and the Simhash algorithm;

step5, setting the weight values corresponding to the similarity calculated in the steps Step2, Step3 and Step4 as I, J and Z respectively, wherein the weight values I, J and Z meet the requirement that I + J + Z is 1, and calculating the final similarity R (A, B) of the text A and the text B as I (A, B) x I + J (A, B) x J + Z (A, B) x Z by using the similarity I (A, B) and the weight value I, the similarity J (A, B) and the weight value J, the similarity Z (A, B) and the weight value Z;

in Step2, the formula for calculating the similarity I (a, B) between the text a and the text B is:

2. the text similarity detection method according to claim 1, characterized in that: in Step1, the input text a and the text B are short texts.

3. The text similarity detection method according to claim 1, characterized in that: preprocessing the text A and the text B in the Step2 and the Step3, wherein the preprocessing comprises word segmentation, synonym replacement and stop word removal; and performing segmentation, synonym replacement and stop word by using the segmentation packet, the synonym library and the stop word library respectively.