CN109948125B - Method and system for improved Simhash algorithm in text deduplication - Google Patents

Method and system for improved Simhash algorithm in text deduplication Download PDF

Info

Publication number
CN109948125B
CN109948125B CN201910225442.9A CN201910225442A CN109948125B CN 109948125 B CN109948125 B CN 109948125B CN 201910225442 A CN201910225442 A CN 201910225442A CN 109948125 B CN109948125 B CN 109948125B
Authority
CN
China
Prior art keywords
weight
information
word
simhash
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910225442.9A
Other languages
Chinese (zh)
Other versions
CN109948125A (en
Inventor
张仕斌
张航
盛志伟
万国根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haikou Lingjie Information Technology Co.,Ltd.
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN201910225442.9A priority Critical patent/CN109948125B/en
Publication of CN109948125A publication Critical patent/CN109948125A/en
Application granted granted Critical
Publication of CN109948125B publication Critical patent/CN109948125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of information processing, and discloses a method and a system for an improved Simhash algorithm in text deduplication, wherein weights are obtained by weighting based on a TF-IDF algorithm and an information entropy, the weights are sorted according to the distribution in a document, and the hash generated by each feature vocabulary is subjected to XOR with the position of the feature vocabulary; after improved weight calculation, a weight threshold value W is introducedtAnd increasing text distribution information to enable the finally generated fingerprint to embody the proportion of key information, and analyzing the relevance of the fingerprint information and the weight. Simulation experiments show that the optimized weight calculation of the invention can effectively improve the performance of the Simhash algorithm, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of duplication removal rate, recall rate, F value and the like, and a good effect is achieved in the aspect of text duplication removal.

Description

Method and system for improved Simhash algorithm in text deduplication
Technical Field
The invention belongs to the technical field of information processing, and particularly relates to a method and a system for an improved Simhash algorithm in text deduplication.
Background
Currently, the current state of the art commonly used in the industry is such that:
in terms of removing redundant data, the Simhash algorithm is currently recognized as the best deduplication algorithm. The algorithm is a locality sensitive hash algorithm, and can perform probability dimension reduction on high-dimensional data, map the high-dimensional data into fingerprints with fewer bits and fixed, and then perform similarity comparison on the fingerprints to reflect the similarity degree between the data. Where similarity comparisons typically use hamming distance or edit distance. The Simhash algorithm has the advantages of high processing speed and high result accuracy.
Nowadays, the Simhash algorithm is widely applied to the fields of approximate text detection, redundant data deduplication, anomaly detection and the like. The director, Zhengqinghua and the like propose a multi-Simhash-based fingerprint algorithm, and similarity calculation is carried out on a plurality of fingerprint values through a k-dimensional multi-surface, so that the problems of single fingerprint and serious information loss are effectively solved; the reduction operation is added into the Simhash algorithm for Chen wave, Pan-Yong wave and the like, and a threshold value T is subtracted from the result of the finally combined result sequence string, so that the accuracy of the Simhash algorithm is improved. NIS, Qian Q and the like combine a Simhash algorithm and a CNN for malware detection, and the identification rate and performance of the malware are improved by converting the Simhash algorithm and the CNN into gray images.
In summary, the problems of the prior art are as follows:
(1) in the prior art, the data processing method has high processing speed but low result accuracy.
(2) The existing Simhash algorithm is deficient in the aspect of weight calculation, and the proportion of key characteristic items cannot be reflected in the generated hash fingerprint. (3) The prior art fails to embody the distribution information in the document feature vocabulary.
The difficulty of solving the technical problems is as follows:
in order to improve the text de-weight effect and the accuracy of the Simhash algorithm and solve the defect that the Simhash algorithm cannot embody the distribution information, the concept of information entropy is introduced, keywords in a document are weighted in an entropy weighting mode, a weight calculation formula is optimized, and key word distribution information is added in the hash calculation, so that the traditional Simhash algorithm is optimized, and finally the feasibility and the reasonability of the algorithm are demonstrated through a simulation experiment.
The significance of solving the technical problems is as follows:
the algorithm introduces TF-IDF and information entropy, text distribution information is increased by optimizing weight and threshold calculation in the Simhash algorithm, so that finally generated fingerprints can reflect the proportion of key information, and the relevance of fingerprint information and weight is analyzed. Simulation experiments show that: the performance of the Simhash algorithm can be effectively improved by optimizing the weight calculation, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of the duplication removal rate, the recall rate, the F value and the like, and a good effect is achieved in the aspect of text duplication removal.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an improved method and system for a Simhash algorithm in text deduplication.
The invention is realized in such a way that a method for an improved Simhash algorithm in text deduplication comprises the following steps: weighting based on a TF-IDF algorithm and the information entropy to obtain weights, sequencing according to distribution in the document, and performing XOR on the hash generated by each feature vocabulary and the position of the feature vocabulary;
after improved weight calculation, a weight threshold value W is introducedtAnd increasing text distribution information to enable the finally generated fingerprint to embody the proportion of key information, and analyzing the relevance of the fingerprint information and the weight.
Further, the method of the improved Simhash algorithm in text deduplication specifically comprises the following steps:
step 1, initialization:
determining the size of a data set and the storage cost of a Simhash digit and an f-dimensional vector space, and initializing f-digit binary numbers s to be 0;
step 2, document preprocessing:
performing word segmentation and word stop removal operation on the document to form a plurality of characteristic word items M { };
and step 3, weight calculation:
calculating TF-IDF value and left and right information entropy of the feature item after word segmentation, respectively, using the square average value of the TF value and the IDF value as the final weight of the feature item, and introducing a threshold value WtPreventing document feature distortion;
step 4, hash calculation:
performing hash calculation on the feature items in the second step, and introducing an exclusive or operation between a position factor and a hash as a final hash value of the feature items, wherein the hash value comprises position information of the feature items and is recorded as H { }, and the H { } is hash ();
and 5, accumulating the weight of the characteristic item generated in the third step and the f-bit hash value generated in the fourth step to perform the following operations:
Figure BDA0002005025960000031
step 6, compression transformation:
converting each digit of the finally generated secondary fingerprint vector V to finally generate a f-digit hash fingerprint S of the document;
Figure BDA0002005025960000032
further, n keywords are extracted from the document and are respectively { p }1,p2,p3,…pnW ═ W weight of each keyword1,w2,w3,…,wn}; generating hash values for n keywords, the result being H ═ H1,h2,h3,…,hnAnd after superposition, generating a secondary fingerprint F ═ F1,f2,f3,...fmM is the number of fingerprint bits, and finally F is the value of F in FiWhether the generated Simhash fingerprint is greater than 0 is S;
if some characteristic word p existskThe weight is
wk>>wj,j∈[1,n]∩j≠k;
S is represented by pkAnd (6) determining.
Further, after introducing the weight threshold, the weight is calculated as follows:
Figure BDA0002005025960000041
further, the information entropy is:
H(X)=-∑(xi∈X)P(xi)log2P(xi)
where X represents the information probability space X ═ X1:P(x1),x2:P(x2),...,xn:P(xn) H (x) represents a measure of uncertainty for the random variable x.
Further, the information entropy includes left and right information entropy, and the formula is as follows:
Figure BDA0002005025960000042
Figure BDA0002005025960000043
wherein W represents a word, EL(W) represents the left entropy of the word, P (aW | W) represents the probability of different words appearing on the left side of the word, and the a variable is a variance value representing the vocabulary combined with W; eR(W) is the right entropy.
Further, the entropy weighting calculation method includes:
Figure BDA0002005025960000044
averaging the left and right information entropies of the feature words; by Hk(w) to represent the amount of entropy information for the word; entropy factor HkAdding the weight value into a weight value calculation formula, taking the square average of the two as the word weight, and taking the following steps:
Figure BDA0002005025960000045
characteristic word tkIn document djThe more times, the fewer documents in which the feature word appears in the training set, and the larger the amount of information thereof, the higher the weight thereof.
Another object of the present invention is to provide a control system of an improved Simhash algorithm in text deduplication, which implements the improved method of a Simhash algorithm in text deduplication.
It is another object of the present invention to provide a redundant data storage medium for removing a Simhash algorithm in text deduplication, which implements the improved method for removing a Simhash algorithm in text deduplication.
In summary, the advantages and positive effects of the invention are:
the method has good effect on removing redundant data, and the Simhash algorithm is a local sensitive hash algorithm and can perform probability dimension reduction on high-dimensional data, map the high-dimensional data into fingerprints with fewer bits and fixed, and then perform similarity comparison on the fingerprints to reflect the similarity degree between the data. Where similarity comparisons typically use hamming distance or edit distance. The Simhash algorithm has the advantages of high processing speed and high result accuracy.
Aiming at the defects of the traditional Simhash algorithm in the aspect of weight calculation and the condition that the distribution information of a document feature vocabulary cannot be considered in the algorithm, the invention optimizes the weight calculation, uses TF-IDF and the square average of information entropy as the weight value of a feature word, introduces a weight threshold value in consideration of information distortion caused by overlarge partial weight, and introduces the position information of the feature word into the hash calculation on the basis of the weight threshold value, thereby improving the de-weight rate and the precision rate of the Simhash algorithm, and proving that the E-Simhash algorithm is superior to the traditional Simhash algorithm in all aspects through simulation experiments.
Through simulation experiments, full-network news data in a fox search laboratory is used as a document set, and in a de-duplication rate comparison experiment, the experimental result is shown in fig. 4. The similarity threshold T is modified, the duplication removal rates of the E-Simhash algorithm are respectively 0.833:0.679, 0.751:0.529, 0.687:0.476 and 0.661:0.451, the duplication removal rates are superior to those of the traditional Simhash algorithm, the duplication removal rates of the E-Simhash algorithm all show a descending trend along with the increase of article variation, and the simulation experiment result is shown in FIG. 5. Finally, in the comparison of precision, recall and F-value, as shown in FIG. 6, the E-Simhash algorithm is superior to the traditional Simhash algorithm in terms of precision 0.963:0.818, recall 0.867:0.621 and F1 values 0.912: 0.706. Simulation experiment results show that: the performance of the Simhash algorithm can be effectively improved by optimizing the weight calculation, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of the duplication removal rate, the recall rate, the F value and the like, and a good effect is achieved in the aspect of text duplication removal.
Drawings
Fig. 1 is a flowchart of a Simhash algorithm according to an embodiment of the present invention.
Fig. 2 is a diagram illustrating the effect of location on Simhash according to an embodiment of the present invention.
Fig. 3 is a process diagram of the E-Simhash algorithm according to the embodiment of the present invention.
Fig. 4 is a graph comparing the repetition rates at different hamming distances according to the embodiment of the present invention.
Fig. 5 is a graph comparing different threshold deduplication ratios provided by embodiments of the present invention.
FIG. 6 is a graph comparing overall performance provided by embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the prior art, the data processing method has high processing speed and low result accuracy.
To solve the above problems, the present invention will be described in detail with reference to specific analysis.
The method for the improved Simhash algorithm in text deduplication provided by the embodiment of the invention comprises the following steps: weighting based on a TF-IDF algorithm and the information entropy to obtain weights, sequencing according to distribution in the document, and performing XOR on the hash generated by each feature vocabulary and the position of the feature vocabulary;
after improved weight calculation, a weight threshold value W is introducedtAnd increasing text distribution information to enable the finally generated fingerprint to embody the proportion of key information, and analyzing the relevance of the fingerprint information and the weight.
The invention is further described below in conjunction with the definition analysis.
1. Analysis of Simhash algorithm
The principle of defining a SimHash algorithm is that for two given variables x, y, the hash function h always satisfies the following equation:
Prh∈F(h(x)=h(y))=sim(x,y) (1)
where sim (x, y) is an affinity function, and generally, the affinity of variables x, y is also expressed by a jacobian function, and sim (x, y) is expressed as follows:
Figure BDA0002005025960000061
h belongs to a hash function cluster F, and the following conditions need to be satisfied:
1) if d (x, y) ≦ d1Then Prh∈F(h(x)=h(y))≥p1
2) If d (x, y) ≧ d2Then Prh∈F(h(x)=h(y))≤p2
Is called F as (d)1,d2,p1,p1) The sensitive hash cluster function of (1). Where d (x, y) represents the distance between the variables x, y, colloquially, if x, y are similar enough, then the probability that they map to the same hash function is sufficiently large, otherwise the probability that the hash values are equal is sufficiently small.
Because the biggest difference between the traditional hash function and the Simhash function is local sensitivity, if some local slight modifications are made on input data, completely different results can be obtained after the traditional hash function operation, and the results of Simhash calculation are very similar, so that the similarity degree between source data can be expressed by using the fingerprint similarity degree generated by the Simhash function.
2. Simhash algorithm flow:
the Simhash algorithm process is that firstly, an f-dimension space is defined, then a vector corresponding to each feature is defined in the space, and then all vectors are combined with the weight of the vectors to carry out weighting and summation to obtain a sum vector as a result. Finally, the result is further compressed and converted, and the rule is as follows: and obtaining a corresponding f-bit signature information for each vector, if the value of the vector dimension is greater than 0, setting the position of the signature to be 1, and otherwise, setting the position to be 0. Through the conversion mode, the obtained signature information proves the information of the value of the vector in each dimension.
The Simhash algorithm flow chart is shown in fig. 1. The Simhash algorithm comprises the following specific steps:
step 1: initialization
And determining a Simhash bit number and an f-dimensional vector space according to the size of the data set and the storage cost, and initializing f-bit binary numbers s to be 0.
Step 2: document pre-processing
The method mainly comprises two parts, wherein the first part is word segmentation, searching characteristic words of a document, removing stop words of the document and the like. The second is weight assignment, which is generally ignored here that the weight calculation is set to 1.
And step 3: generating a hash value
Calculating an f-bit hash value of each feature word in the step two by using a traditional hash algorithm, and performing the following operations:
Figure BDA0002005025960000071
Figure BDA0002005025960000081
step 4, mule: compression transformation:
and (4) aiming at the finally generated vector V, performing conversion treatment on each bit.
Figure BDA0002005025960000082
And 5: fingerprint generation: and outputting the final signature S as the fingerprint of the document, and then calculating the similarity by using the hamming distance or the editing distance.
Step 6: and (3) distance calculation: similarity calculation is performed in the Simhash algorithm using Hamming distance. Hamming distance measures the similarity between two documents by comparing the number of different fingerprints in the two documents. The greater the hamming distance is, the lower the similarity representing the two character strings, otherwise, the higher the similarity of the two character strings is. For a binary string, the Hamming distance of two binary values can be calculated using an XOR operation.
The present invention will be further described with reference to the following examples.
Example 1:
let a, b be two binary numbers, where a is 00110 and b is 01110. It can be seen that the two binary numbers of a and b are different only in the second bit, so Hamming (a, b) is 1. The number of 1's in the XOR result can also be counted by using the XOR operation.
Figure BDA0002005025960000083
The total number of the capsules is 1, so the Hamming distance is 1.
The traditional Simhash algorithm is usually set to be 1 or the number of times of appearance of the feature words in the aspect of weight calculation, information loss is easily caused, the accuracy of the final Simhash fingerprint is reduced, the Simhash algorithm can know that the Simhash algorithm does not show word distribution information, and the final generated Simhash fingerprint cannot be influenced after the key feature words are adjusted.
As shown in fig. 2, the final meanings may be different from each other when the positions of the two keywords are adjusted, but the fingerprints generated by the conventional Simhash algorithm are the same.
Example 2
In order to improve the text deduplication effect and accuracy of the Simhash algorithm and solve the defect that the Simhash algorithm cannot embody distribution information, the Simhash algorithm (E-Simhash for short) based on information entropy weighting is provided. The algorithm introduces TF-IDF and information entropy, text distribution information is increased by optimizing weight and threshold calculation in the Simhash algorithm, so that finally generated fingerprints can reflect the proportion of key information, and the relevance of fingerprint information and weight is analyzed.
Simulation experiments show that: the performance of the Simhash algorithm can be effectively improved by optimizing the weight calculation, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of the duplication removal rate, the recall rate, the F value and the like, and a good effect is achieved in the aspect of text duplication removal.
In the present invention, (1) the word frequency-inverse file frequency includes:
the word frequency-inverse document frequency (TF-IDF) algorithm is a commonly used text feature weight calculation method, and a feature word tkIn document djThe TF-IDF value in (1) is denoted as tfidf (t)k,dj) The following definitions are provided:
definition II: characteristic order tkIn document djOf (d) is a frequency tf (t)k,dj) Is composed of
Figure BDA0002005025960000091
In the formula nj,kRepresentation feature word tkIn document djNumber of occurrences, Σinj,iRepresenting a document djThe number of all feature words in (1).
Defining three: inverse document frequency idf (t)k,dj) Is a coefficient that trades off the importance of the token, defined as:
Figure BDA0002005025960000092
in the formula: { j: t is tk∈djIs a word containing characteristics tkFor document reviews, | D | is the total number of files in the corpus.
Defining four: TF-IDF function, the word frequency weight of the characteristic word is defined as:
wk=tfidf(tk,dj)=tf(tk,dj)*idf(tk) (5)
in the present invention, (2) the information entropy includes:
the entropy of information represents a measure of the uncertainty of the result before a random event occurs, and the amount of information one gets from the event after the random event occurs.
According to the definition of information entropy:
H(X)=-∑(xi∈X)P(xi)log2P(xi) (6)
where X represents the information probability space X ═ X (X)1:P(x1),x2:P(x2),...,xn:P(xn) H (X) represents a measure of uncertainty of the random variable X.
In the invention, (3) left and right information entropy
Left-right entropy refers to the entropy of the left boundary and the entropy of the right boundary of a multi-word expression. The formula for the left and right entropy is as follows:
Figure BDA0002005025960000101
Figure BDA0002005025960000102
wherein W represents a word, EL(W) represents the left entropy of the word, P (aW | W) represents the probability of different words appearing on the left side of the word, and the a variable is a variance value representing the vocabulary combined with W. ER(W) is the same as above for right entropy.
In the present invention, (4) the entropy weighting calculation method includes:
the invention adopts an entropy weighting calculation method
Figure BDA0002005025960000103
The entropy of the left and right information of the feature word is averaged. By Hk(w) to represent the amount of entropy information for the word. Entropy factor HkAdding the weight value into a weight value calculation formula, taking the square average of the two as the word weight, and taking the following steps:
Figure BDA0002005025960000104
the physical meaning of the above formula is: characteristic word tkIn document djThe more times of occurrence in the training set, the more the feature word appears in the training setThe fewer documents of (a), and the greater the amount of information thereof, the higher the weight thereof.
In the present invention, the Simhash algorithm (E-Simhash) based on entropy weighting specifically includes:
weighting is carried out on the information entropy based on the TF-IDF algorithm to obtain weights, sequencing is carried out according to the distribution of the weights in the document, and the hash generated aiming at each characteristic vocabulary is subjected to exclusive OR with the position of the hash.
However, after improved weight calculation, partial feature sub-weights are too large due to factors such as incompleteness of a training set and the like, and finally precision ratio is reducedt. The following is a demonstration of the problem caused by the uneven weighting.
Let n keywords extracted from a document be { p }1,p2,p3,...pnW ═ W weight of each keyword1,w2,w3...,wn}. Generating hash values for n keywords, the result being H ═ H1,h2,h3,...,hnAnd after superposition, generating a secondary fingerprint F ═ F1,f2,f3,...fmM is the number of fingerprint bits, and finally F is the value of F in FiIf the value is greater than 0, the Simhash fingerprint is S.
If there is a certain characteristic word pkWeight of which
wk>wj,j∈[1,n]∩j≠k (11)
S is mainly composed of pkAnd (6) determining. The following was demonstrated:
is provided with hi={ai1,ai2,ai3,...,aim},aijIs a binary variable, then
Figure BDA0002005025960000111
Extracting wkThen there is
Figure BDA0002005025960000112
Due to wk>>wjTherefore:
Figure BDA0002005025960000113
so at this time:
Figure BDA0002005025960000114
finally F is mainly and pkCorrelation, prove complete.
The above proof also reflects the effect of the weights on the Simhash fingerprint.
After the weight threshold is introduced, the weight calculation at this time is as shown in equation (16):
Figure BDA0002005025960000115
in summary, the flow of the E-Simhash algorithm is shown in FIG. 3.
The E-Simhash algorithm is different from the traditional Simhash algorithm in the following three points, the information entropy is mainly introduced on the basis of TF-IDF to calculate the weight of the feature words, the square mean of the information entropy and the feature words is used as the final weight of the feature words, meanwhile, in order to avoid fingerprint distortion caused by the fact that the weight is too high, a weight threshold value is introduced, and the calculation mode is shown in a formula (16). And finally, carrying out exclusive or with the position of the feature word when the feature word hash is generated, so that the hash contains the position distribution information of the document.
The present invention is further described below with reference to specific experiments and simulations.
Simulation experiment and analysis: the invention mainly simulates a real application scene and verifies whether the performance of the E-Simhash algorithm is superior to that of the traditional Simhash algorithm.
Experimental environment and data set:
the experimental environment was deployed on a desktop computer with the following machine parameters:
table 1 experimental environmental parameters
Figure BDA0002005025960000121
The data set is from 2012 versions of news data of the whole network in a dog search laboratory, is classified news of nearly 20 columns from a plurality of news sites, eliminates data with less than 800 characters, and randomly selects 1565 news from the classified news for subsequent experiments.
Firstly, randomly selecting a plurality of news from 1565 news according to a modification proportion to carry out random operations such as modification, deletion, displacement, replacement and the like, controlling the similarity of the modified articles and the original articles to have a certain threshold value T, generating a sample set to be tested, then comparing the sample set with the algorithm in the patent by using the traditional Simhash, and counting relevant indexes of the experiment.
Analysis of Experimental results
In the experimental results, four indexes are commonly used for evaluation, namely, the duplication removal rate, the precision check rate, the recall rate and the F value, wherein the duplication removal rate is the ratio of the number of samples with correct classification to the total number of samples, and is predicted as the ratio of the number of the homologous articles to the total number of the articles in the experiment.
The invention is further described below in connection with specific experiments.
Experiment I, weight removal rate comparison:
randomly selecting 1162 news from 1565 news, randomly modifying, selecting different Hamming distances, comparing the accuracy of the two algorithms, wherein in the test, T is 15%, namely each news keeps no more than 15% of modification, the fingerprint length is 128 bits, and the word weight threshold W is settThe results are shown in fig. 4 below, at 90.
The experimental result shows that when the Hamming distance is more than 2, the E-Simhash algorithm has high de-weight rate. In practical application, the Hamming distance is about 10 generally, so that the de-duplication effect of the E-Simhash algorithm is better.
Experiment two: modify T-threshold comparison:
the similarity threshold T of the text is modified in the experiment, the Hamming distance is 10 under the modification of 5%, 10%, 15% and 20%, namely the Hamming distance is similar to the Hamming distance when the Hamming distance is lower than 10, and the deduplication rates of the two algorithms are compared. As shown in fig. 5, the duplication elimination rates of the E-Simhash algorithm are better than those of the conventional Simhash algorithm by 0.833:0.679, 0.751:0.529, 0.687:0.476 and 0.661:0.451, respectively, and all the duplication elimination rates show a decreasing trend as the article changes. Experimental results show that under different modification thresholds T, the E-Simhash algorithm is superior to the traditional Simhash algorithm.
Experiment three: precision, recall and F-value comparison:
in the experiment, an article is randomly selected from a news set to be randomly modified, 90% of similarity with the original article is guaranteed, and the precision ratio, the recall ratio and the F1 value based on the Simhash fingerprint and the E-Simhash algorithm are compared. Wherein the Hamming distance is 10; the experiment was performed 100 times and their average was taken as the final result, and the result is shown in fig. 6. According to experimental data, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of precision ratio of 0.963:0.818, recall ratio of 0.867:0.621 and F1 value of 0.912: 0.706. The result shows that the precision ratio, the recall ratio and the F value of the E-Simhash algorithm are greatly improved compared with those of the common Simhash algorithm, and the superiority of the E-Simhash algorithm can be proved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A method for text deduplication based on a modified Simhash algorithm is characterized in that the method for text deduplication based on the modified Simhash algorithm comprises the following steps: weighting based on a TF-IDF algorithm and the information entropy to obtain weights, sequencing according to distribution in the document, and performing XOR on the hash generated by each feature vocabulary and the position of the feature vocabulary;
after improved weight calculation, a weight threshold value W is introducedtAdding text distribution information to the finalThe generated fingerprint reflects the proportion of key information, and the relevance between the fingerprint information and the weight is analyzed;
the method for text deduplication based on the improved Simhash algorithm specifically comprises the following steps:
step 1, initialization:
determining the size of a data set and the storage cost of a Simhash digit and an f-dimensional vector space, and initializing f-digit binary numbers S to be 0;
step 2, document preprocessing:
performing word segmentation and word stop removal operation on the document to form a plurality of characteristic word items M { };
and step 3, weight calculation:
calculating TF-IDF value and left and right information entropy of the feature item after word segmentation, respectively, using the square average value of the TF value and the IDF value as the final weight of the feature item, and introducing a threshold value WtPreventing document feature distortion;
step 4, hash calculation:
performing hash calculation on the feature items in the step 2, and introducing an exclusive or operation between a position factor and a hash as a final hash value of the feature items, wherein the hash value comprises position information of the feature items and is recorded as H { };
step 5, performing accumulation operation on the feature item weight generated in the step 3 and the f-bit hash value generated in the step 4 to obtain a secondary fingerprint vector V;
step 6, compression transformation: and (4) converting each digit of the finally generated secondary fingerprint vector V to finally generate the f-digit hash fingerprint S of the document.
2. The method for text deduplication based on the modified Simhash algorithm of claim 1, wherein n keywords of { p } each extracted from the document are n keywords1,p2,p3,...,pnW ═ W weight of each keyword1,w2,w3,...,wn}; generating hash values for n keywords, the result being H ═ H1,h2,h3,...,hnAnd generating a secondary fingerprint after superpositionF={f1,f2,f3,…,fmM is the number of fingerprint bits, and finally F is the value of F in FiWhether the generated Simhash fingerprint is greater than 0 is S;
if some characteristic word p existskThe weight is
wk>>wj,j∈[1,n]∩j≠k;
S is then represented by pkAnd (6) determining.
3. The method for text deduplication based on the modified Simhash algorithm of claim 1, wherein after introducing the weight threshold, the weight is calculated as follows:
Figure FDA0002717766100000021
4. the method for text deduplication based on the modified Simhash algorithm of claim 1, wherein the information entropy is:
H(X)=-∑(xi∈X)P(xi)log2P(xi)
where X represents the information probability space X ═ X1:P(x1),x2:P(x2),...,xn:P(xn) H (X) represents a measure of uncertainty of the random variable X.
5. The method for text deduplication based on the modified Simhash algorithm of claim 1, wherein the information entropy comprises left and right information entropy, and the formula is as follows:
Figure FDA0002717766100000022
Figure FDA0002717766100000023
wherein W represents a word, EL(W) represents the left entropy of the word, P (aW | W) represents the probability of different words appearing to the left of the word; the a variable is a variable value and represents a word combined with W; eR(W) is the right entropy.
6. The method for text deduplication based on the modified Simhash algorithm of claim 5, wherein the entropy weighting calculation method comprises:
Figure FDA0002717766100000024
averaging the left and right information entropies of the feature words; by Hk(w) to represent the amount of entropy information for the word; entropy factor HkAdding the weight value into a weight value calculation formula, taking the square average of the two as the word weight, and taking the following steps:
Figure FDA0002717766100000025
characteristic word tkIn document djThe more times, the fewer documents in which the feature word appears in the training set, and the larger the amount of information thereof, the higher the weight thereof.
7. A control system implementing the method in text deduplication based on the modified Simhash algorithm of claim 1.
8. A redundancy data removing storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method in text deduplication based on the modified Simhash algorithm of claim 1.
CN201910225442.9A 2019-03-25 2019-03-25 Method and system for improved Simhash algorithm in text deduplication Active CN109948125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910225442.9A CN109948125B (en) 2019-03-25 2019-03-25 Method and system for improved Simhash algorithm in text deduplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910225442.9A CN109948125B (en) 2019-03-25 2019-03-25 Method and system for improved Simhash algorithm in text deduplication

Publications (2)

Publication Number Publication Date
CN109948125A CN109948125A (en) 2019-06-28
CN109948125B true CN109948125B (en) 2020-12-08

Family

ID=67011555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910225442.9A Active CN109948125B (en) 2019-03-25 2019-03-25 Method and system for improved Simhash algorithm in text deduplication

Country Status (1)

Country Link
CN (1) CN109948125B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442679A (en) * 2019-08-01 2019-11-12 信雅达系统工程股份有限公司 A kind of text De-weight method based on Fusion Model algorithm
CN110750731B (en) * 2019-09-27 2023-10-27 成都数联铭品科技有限公司 Method and system for removing duplicate of news public opinion
US11068208B2 (en) 2019-10-29 2021-07-20 EMC IP Holding Company LLC Capacity reduction in a storage system
CN110837555A (en) * 2019-11-11 2020-02-25 苏州朗动网络科技有限公司 Method, equipment and storage medium for removing duplicate and screening of massive texts
CN113011194B (en) * 2021-04-15 2022-05-03 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113300830B (en) * 2021-05-25 2022-05-27 湖南遥昇通信技术有限公司 Data transmission method, device and storage medium based on weighted probability model
CN114201959A (en) * 2021-11-16 2022-03-18 湖南长泰工业科技有限公司 Mobile emergency command method
CN116932526B (en) * 2023-09-19 2023-11-24 天泽智慧科技(成都)有限公司 Text deduplication method for open source information

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241277A (en) * 2018-07-18 2019-01-18 北京航天云路有限公司 The method and system of text vector weighting based on news keyword

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463605B (en) * 2017-06-21 2021-06-11 北京百度网讯科技有限公司 Method and device for identifying low-quality news resource, computer equipment and readable medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241277A (en) * 2018-07-18 2019-01-18 北京航天云路有限公司 The method and system of text vector weighting based on news keyword

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于Simhash的大规模文档去重改进算法研究;王诚,王宇成;《计算机技术与发展》;20181115(第2期);正文第2部分第2.2-2.3节 *
基于TFIDF文本特征加权方法的改进研究;张保富,施化吉,马素琴;《计算机应用与软件》;20110429(第2期);正文第2部分 *
基于上下文词频词汇量指标的新词发现方法;邢恩军,赵富强;《计算机应用与软件》;20160715(第6期);正文第1部分第1.3节 *

Also Published As

Publication number Publication date
CN109948125A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN109948125B (en) Method and system for improved Simhash algorithm in text deduplication
Koniusz et al. Higher-order occurrence pooling for bags-of-words: Visual concept detection
US10346257B2 (en) Method and device for deduplicating web page
CN111027069B (en) Malicious software family detection method, storage medium and computing device
Zheng et al. $\mathcal {L} _p $-Norm IDF for Scalable Image Retrieval
CN107368542B (en) Method for evaluating security-related grade of security-related data
CN112368697A (en) System and method for evaluating a loss function or a gradient of a loss function via dual decomposition
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN113807073B (en) Text content anomaly detection method, device and storage medium
Nie et al. Implicit weight learning for multi-view clustering
CN109376235B (en) Feature selection method based on document layer word frequency reordering
WO2023134074A1 (en) Text topic generation method and apparatus, and device and storage medium
US11281714B2 (en) Image retrieval
CN112579783B (en) Short text clustering method based on Laplace atlas
CN110968693A (en) Multi-label text classification calculation method based on ensemble learning
CN114139634A (en) Multi-label feature selection method based on paired label weights
CN111782805A (en) Text label classification method and system
CN114202038B (en) Crowdsourcing defect classification method based on DBM deep learning
JP2008282111A (en) Similar document retrieval method, program and device
CN113609247A (en) Big data text duplicate removal technology based on improved Simhash algorithm
CN111914718A (en) Feature weighting PCA face recognition method based on average influence value data conversion
CN111651643A (en) Processing method of candidate content and related equipment
CN111382273A (en) Text classification method based on feature selection of attraction factors
Iwata et al. Unsupervised Object Matching for Relational Data
CN116304749B (en) Long text matching method based on graph convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230707

Address after: 518000 room 321, building 2, Nanke Chuangyuan Valley, Taoyuan community, Dalang street, Longhua District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen lizhuan Technology Transfer Center Co.,Ltd.

Address before: 610225, No. 24, Section 1, Xuefu Road, Southwest Economic Development Zone, Chengdu, Sichuan

Patentee before: CHENGDU University OF INFORMATION TECHNOLOGY

Effective date of registration: 20230707

Address after: 201400 Floor 2, No. 2900, Nanxinggang Road, Fengxian District, Shanghai

Patentee after: Shanghai Yiju Technology Co.,Ltd.

Address before: 518000 room 321, building 2, Nanke Chuangyuan Valley, Taoyuan community, Dalang street, Longhua District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen lizhuan Technology Transfer Center Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231226

Address after: No. 135, Hongtian Junyue Zhongchuang Space, 1st Floor, Unit 9A, Bund Center, No. 88 Duhai Road, Binhai Street, Longhua District, Haikou City, Hainan Province, 570100

Patentee after: Haikou Lingjie Information Technology Co.,Ltd.

Address before: 201400 Floor 2, No. 2900, Nanxinggang Road, Fengxian District, Shanghai

Patentee before: Shanghai Yiju Technology Co.,Ltd.