Method and system of the improved Simhash algorithm in text duplicate removal
Technical field
The invention belongs to technical field of information processing more particularly to a kind of improved Simhash algorithm in text duplicate removal
Method and system.
Background technique
Currently, the prior art commonly used in the trade is such that
In terms of removing redundant data, Simhash algorithm is currently generally acknowledged best Duplicate Removal Algorithm.The algorithm is a kind of
Local sensitivity hash algorithm, high dimensional data can be carried out probability dimensionality reduction and be mapped as the less and fixed fingerprint of digit by it, it
Similarity-rough set is carried out to fingerprint to reflect the similarity degree between data again afterwards.Wherein similarity-rough set usually using hamming away from
From or editing distance.Simhash algorithm advantage is that processing speed is fast, and result precision is high.
Nowadays, Simhash algorithm is widely used in the fields such as approximate text detection, redundant data duplicate removal, abnormality detection.
Dong Bo, Zheng Qinghua etc. propose one kind based on more Simhash fingerprint algorithms, tie up more curved surfaces by k using a variety of fingerprint values and carry out
Similarity calculation, efficiently solves that fingerprint is single, and information loses serious problem;Chen Bo, Pan Yongtao etc. are in Simhash algorithm
In joined depreciation operation, a threshold value T is subtracted to most latter incorporated result sequence string result, thus improve Simhash calculation
The accuracy of method.Simhash algorithm and CNN are combined by NI S, Qian Q etc. is used for malware detection, by being converted into
Gray level image improves Malware discrimination and performance.
In conclusion problem of the existing technology is:
(1) in the prior art, data processing method processing speed is fast, but result precision is low.
(2) existing shortcoming of the Simhash algorithm in terms of weight calculation cannot embody crucial spy in the hash fingerprint of generation
Levy item proportion.(3) fail to embody the distributed intelligence in file characteristics vocabulary in the prior art.
Solve the difficulty of above-mentioned technical problem:
In order to promote text duplicate removal effect, the accurate rate of Simhash algorithm, distribution can not be embodied by solving Simhash algorithm
The shortcomings that information, introduces the concept of comentropy, carries out tax power to the keyword in document by the way of entropy weighting, optimizes weight
Calculation formula, and keyword distributed intelligence is added in hash calculating, to reach the optimization to traditional Si mhash algorithm, most
Pass through the simulation experiments feasibility of the algorithm, reasonability afterwards.
Solve the meaning of above-mentioned technical problem:
The algorithm introduces TF-IDF and comentropy, by the weight and threshold calculations in optimization Simhash algorithm, increases text
This distributed intelligence, so that the fingerprint ultimately generated can more embody the specific gravity of key message, and being associated with to finger print information and weight
Property is analyzed.Emulation experiment shows: optimization weight calculation can effectively promote the performance of Simhash algorithm, E-Simhash
Algorithm is superior to traditional Si mhash algorithm in terms of duplicate removal rate, recall rate, F, and achieves in terms of text duplicate removal good
Good effect.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of improved Simhash algorithms in text duplicate removal
Method and system.
The invention is realized in this way a kind of method of improved Simhash algorithm in text duplicate removal includes: to utilize base
It is weighted to obtain weight in TF-IDF algorithm and comentropy, and is ranked up according to distribution in a document, to each feature
The hash that vocabulary generates carries out exclusive or with feature vocabulary position again;
After improved weight calculation, weight threshold W is introducedt, increase text distributed intelligence, make the fingerprint body ultimately generated
The specific gravity of existing key message, and the relevance of finger print information and weight is analyzed.
Further, method of the improved Simhash algorithm in text duplicate removal specifically includes:
Step 1:, initialization:
Simhash digit and f dimensional vector space are determined to data set size and carrying cost, while initializing the position f two
System number s is set to 0;
Step 2, document pre-processes:
Document is segmented, stop words is gone to operate, constitutes several feature lexical item M={ } of document;
Step 3, weight calculation:
The TF-IDF value and left and right comentropy for calculating separately the characteristic item after participle, use square of TF value, IDF value
The average value weight final as characteristic item, and introduce threshold value WtFile characteristics are prevented to be distorted;
Step 4, hash is calculated:
Hash calculating is carried out to the characteristic item in step 2, and introduces location factor and hash progress xor operation as special
The final hash value of item is levied, hash value includes the location information of characteristic item, H={ } is denoted as, wherein=hash ();
Step 5:, it is cumulative that the position the f hash value generated in the Feature item weighting and step 4 generated in step 3 is carried out
Lower column operations:
Step 6:, compressed transform:
To the second level fingerprint vector V ultimately produced, conversion processing is carried out to each, f hash of document is ultimately generated and refers to
Line S;
Further, it is respectively { p that n keyword is extracted in document1,p2,p3,…pn, the weight of each keyword is W=
{w1,w2,w3,…,wn};Hash value is generated to n keyword, result is H={ h1,h2,h3,…,hn, it is generated after superposition
Second level fingerprint F={ f1, f2, f3... fm, m is fingerprint digit, finally according to f in FiWhether being greater than 0 generation Simhash fingerprint is
S;
A certain Feature Words p if it existsk, weight is
wk> > wj, j ∈ [1, n] ∩ j ≠ k;
Then s is by pkIt determines.
Further, after introducing weight threshold, weight calculation is shown below:
Further, comentropy are as follows:
H (X)=- ∑ (xi∈X)P(xi)log2P(xi)
Wherein x indicates information probability space X=(x1: P (x1), x2: P (x2) ..., xn: P (xn)), H (X) indicates random and becomes
Measure the probabilistic measurement of x.
Further, comentropy includes left and right comentropy, and formula is as follows:
In formula, W indicates some word, EL(W) the left entropy of the word is indicated, and P (aW | W) indicate occur not on the left of the word
With the probability of word, a variable is a changing value, indicates the vocabulary combined with W;ERIt (W) is right entropy.
Further, entropy weighted calculation method includes:
Feature Words or so comentropy is averaged;Use Hk(w) the entropy information amount of the word is indicated;Entropy factor HkIt is added
In weight computing formula, take the square mean number of the two as word weight, as follows:
Feature Words tkIn document djThe number of middle appearance is more, and the document for occurring the specific word in training set is fewer, and
Its information content is bigger, then its weight is higher.
Another object of the present invention is to provide a kind of sides for implementing the improved Simhash algorithm in text duplicate removal
Control system of the improved Simhash algorithm of method in text duplicate removal.
Another object of the present invention is to provide a kind of sides for implementing the improved Simhash algorithm in text duplicate removal
Removal redundant data storage medium of the improved Simhash algorithm of method in text duplicate removal.
In conclusion advantages of the present invention and good effect are as follows:
The present invention has good effect in terms of removing redundant data, and Simhash algorithm is a kind of local sensitivity Hash
Algorithm, high dimensional data can be carried out probability dimensionality reduction and be mapped as the less and fixed fingerprint of digit by it, later again to fingerprint into
Row similarity-rough set reflects the similarity degree between data.Wherein similarity-rough set usually using Hamming distances or editor away from
From.Simhash algorithm advantage is that processing speed is fast, and result precision is high.
The present invention for traditional Si mhash algorithm in terms of weight calculation shortcoming and algorithm in cannot in view of text
The distributed intelligence of shelves feature vocabulary, the present invention are made by optimization weight calculation using TF-IDF and the square mean number of comentropy
It is characterized the weighted value of word, it is contemplated that fractional weight is excessive to lead to information distortion, introduces weight threshold, and on this basis will be special
The location information of sign word is introduced into hash calculating, to promote the duplicate removal rate of Simhash algorithm, precision ratio, and by imitative
True experiment demonstrates E-Simhash algorithm and is superior to traditional Simhash algorithm in all respects.
By emulation experiment, uses the whole network news data in Sohu laboratory as document sets, compared in duplicate removal rate real
In testing, experimental result is as shown in Figure 4.Modify similarity threshold T, E-Simhash algorithm duplicate removal rate respectively with 0.833:0.679,
0.751:0.529,0.687:0.476,0.661:0.451 are better than traditional Simhash algorithm, and the increasing changed with article
Add, downward trend is all presented in duplicate removal rate, and the simulation experiment result is as shown in Figure 5.Finally in precision ratio, recall rate and F value
In comparison, as shown in fig. 6, E-Simhash algorithm is in precision ratio 0.963:0.818, recall rate 0.867:0.621, F1 value
0.912:0.706 is better than traditional Simhash algorithm,.The simulation experiment result shows: optimization weight calculation can be promoted effectively
The performance of Simhash algorithm, E-Simhash algorithm are superior to traditional Si mhash calculation in terms of duplicate removal rate, recall rate, F
Method, and good effect is achieved in terms of text duplicate removal.
Detailed description of the invention
Fig. 1 is the algorithm flow chart of Simhash provided in an embodiment of the present invention.
Fig. 2 is influence diagram of the position provided in an embodiment of the present invention to Simhash.
Fig. 3 is E-Simhash algorithmic procedure figure provided in an embodiment of the present invention.
Fig. 4 is duplicate removal rate comparison diagram under different Hamming distances provided in an embodiment of the present invention.
Fig. 5 is duplicate removal rate comparison diagram under different threshold values provided in an embodiment of the present invention.
Fig. 6 is comprehensive comparison figure provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
In the prior art, data processing method processing speed is fast, and result precision is low.
To solve the above problems, being described in detail below with reference to concrete analysis to the present invention.
Method of the improved Simhash algorithm provided in an embodiment of the present invention in text duplicate removal includes: using based on TF-
IDF algorithm and comentropy are weighted to obtain weight, and are ranked up according to distribution in a document, raw to each feature vocabulary
At hash again with feature vocabulary position carry out exclusive or;
After improved weight calculation, weight threshold W is introducedt, increase text distributed intelligence, make the fingerprint body ultimately generated
The specific gravity of existing key message, and the relevance of finger print information and weight is analyzed.
Below with reference to defined analysis, the invention will be further described.
1, the analysis of Simhash algorithm
The principle for defining a SimHash algorithm is variable x, the y given for two, under hash function h always meets
Formula:
Prh∈F(h (x)=h (y))=sim (x, y) (1)
Wherein, sim (x, y) ∈ [0,1] is similarity function, and variable x, the phase of y are generally also indicated with Jacobian function
Like degree, sim (x, y) is expressed as follows:
H belongs to hash function cluster F, and it is necessary to meet following condition:
If 1) d (x, y)≤d1, then Prh∈F(h (x)=h (y)) >=p1;
If 2) d (x, y) >=d2, then Prh∈F(h (x)=h (y))≤p2。
F is referred to as (d1, d2, p1, p1) on sensitive hash cluster function.Wherein d (x, y) expression x, the distance between y variable,
It for popular, indicates if when x, y are similar enough, then the probability that they are mapped as same hash function is also just sufficiently large, instead
Cryptographic Hash it is equal probability it is sufficiently small.
Since traditional hash function and the maximum difference of Simhash function are local susceptibility, if for input
Data make a little a little modifications in part, may obtain after traditional hash functional operation entirely different as a result, and Simhash
The result of calculating is then much like, therefore the fingerprint similarity degree of Simhash function generation can be used to indicate between source data
Similarity degree.
2, Simhash algorithm flow:
The process of Simhash algorithm is to define the space of a f dimension first, then defines each in this space
Vector corresponding to feature, then combine itself weight to be weighted all vectors, sum just obtained one and to
Amount is as a result.Compression conversion is further finally carried out to the result again, rule is: a phase is obtained to each vector
Corresponding f signing messages, if the value of vector dimension is greater than 0, the position where setting its signature is 1, is otherwise set to 0.Pass through
Such transform mode, the information of value of the obtained signing messages with regard to exterior syndrome this vector in each dimension.
The algorithm flow chart of Simhash is as shown in Figure 1.Specific step is as follows for Simhash algorithm:
Step 1: initialization
Size and carrying cost determine Simhash digit and f dimensional vector space for data sets, while initializing f
Binary number s is set to 0.
Step 2: document pretreatment
It mainly include two parts, first part is participle, finds feature vocabulary and removal document stop words of document etc..
Second is exactly to assign power, it is however generally that the calculating and setting for generally having ignored weight here is 1.
Step 3: generating hash value
One f hash value is calculated to each Feature Words in step 2 using traditional hashing algorithm, and is carried out following
Operation:
Walk mule 4: compressed transform:
For the vector V ultimately produced, conversion processing is carried out to each.
Step 5: fingerprint generates: exporting fingerprint of the final signature S as the document, carries out Hamming distances or volume again later
It collects distance and calculates similarity.
Step 6: distance calculates: carrying out similarity calculation using Hamming distances in Simhash algorithm.Hamming distances pass through
Compare in two document fingerprints different number to measure the similarity between two documents.Hamming distances are bigger, represent two
The similarity of a character string is lower, on the contrary then two similarity of character string are higher.For string of binary characters, it can be used
XOR operation calculates two binary Hamming distances.
The present invention will be further described below with reference to examples.
Embodiment 1:
If a, b are two binary numbers, wherein a=00110, b=01110.A known to then, two binary numbers of b only have the
Two differences, therefore Hamming (a, b)=1.Xor operation can also be used, count in exclusive or result 1 number.
11 is shared, therefore Hamming distances are 1.
Traditional Si mhash algorithm is usually arranged as the number of 1 or Feature Words appearance in terms of weight calculation, this is easy to make
It is lost at information, causes final Simhash fingerprint accuracy to reduce, and it does not show according to Simhash algorithm
Vocabulary distributed intelligence, key feature tone is whole along after, will not influence the Simhash fingerprint ultimately generated.
Differ widely as shown in Fig. 2, this may result in final meaning under the position adjustment of two keywords, but it is traditional
The fingerprint that generates of Simhash algorithm be the same.
Embodiment 2
In order to promote text duplicate removal effect, the accuracy rate of Simhash algorithm, distribution can not be embodied by solving Simhash algorithm
The shortcomings that information, proposes a kind of Simhash algorithm (abbreviation E-Simhash) based on comentropy weighting.The algorithm introduces TF-
IDF and comentropy increase text distributed intelligence by the weight and threshold calculations in optimization Simhash algorithm, so that most throughout one's life
At fingerprint can more embody the specific gravity of key message, and the relevance of finger print information and weight is analyzed.
Emulation experiment shows: optimization weight calculation can effectively promote the performance of Simhash algorithm, E-Simhash algorithm
It is superior to traditional Si mhash algorithm in terms of duplicate removal rate, recall rate, F, and achieves in terms of text duplicate removal good
Effect.
In the present invention, the reverse document-frequency of (1) word frequency-includes:
Reverse document-frequency (TF-IDF) algorithm of word frequency-is a kind of common text feature weighing computation method, Feature Words
tkIn document djIn TF-IDF value be denoted as tfidf (tk, dj), just like giving a definition:
Define two: feature time tkIn document djFrequency tf (the t of middle appearancek, dj) be
N in formulaJ, kIndicate Feature Words tkIn document djThe number of middle appearance, ∑inJ, iIndicate document djIn all Feature Words
Number.
Define three: anti-document frequency idf (tk, dj) it is the coefficient for weighing Feature Words importance, is defined as:
In formula: { j:tk∈djIt is to contain Feature Words tkDocument summary, | D | for the total number of files in corpus.
Define four: TF-IDF function, the word frequency weight definition of Feature Words are as follows:
wk=tfidf (tk, dj)=tf (tk, dj)*idf(tk) (5)
In the present invention, (2) comentropy includes:
Comentropy indicates the probabilistic measurement of result before chance event generation, and it occurs in chance event
Afterwards, people's information content obtained in the event.
According to the definition of comentropy:
H (X)=- ∑ (xi∈X)P(xi)log2P(xi) (6)
Wherein X indicates information probability space x=(x1: P (x1), x2: P (x2) ..., xn: P (xn)), H (X) indicates random and becomes
Measure the probabilistic measurement of X.
In the present invention, (3) left and right comentropy
Left and right entropy refers to the entropy of the left margin of multi-character words expression and the entropy of right margin.The formula of left and right entropy is as follows:
In formula, W indicates some word, EL(W) the left entropy of the word is indicated, and P (aW | W) indicate occur not on the left of the word
With the probability of word, a variable is a changing value, indicates the vocabulary combined with W.ER(W) it is same as above for right entropy.
In the present invention, (4) entropy weighted calculation method includes:
The present invention uses entropy weight computation method
Here Feature Words or so comentropy is averaged.Use Hk(w) the entropy information amount of the word is indicated.Entropy factor Hk
It is added in weight computing formula, takes the square mean number of the two as word weight, as follows:
The physical significance of above formula are as follows: Feature Words tkIn document djThe number of middle appearance is more, occurs this feature in training set
The document of word is fewer, and its information content is bigger, then its weight is higher.
In the present invention, the Simhash algorithm (E-Simhash) based on entropy weighting specifically includes:
It is weighted to obtain weight first with based on TF-IDF algorithm and comentropy, and according to its distribution in a document
It is ranked up, exclusive or will be carried out with its position again for the hash that each feature vocabulary generates.
But after improved weight calculation, due to the factors such as imperfect of training set, Partial Feature time power will lead to
Weight is excessive, finally precision ratio is caused to decline, and in order to solve this problem, introduces weight threshold Wt.Weight unevenness is caused below
The problem of proved.
If extracting n keyword in a document is respectively { p1, p2, p3... pn, the weight of each keyword is W=
{w1, w2, w3..., wn}.Hash value is generated to n keyword, as a result H={ h1, h2, h3..., hn, it is raw after superposition
At second level fingerprint F={ f1, f2, f3... fm, m is fingerprint digit, finally according to f in FiWhether it is greater than 0 and generates Simhash fingerprint
For S.
Then a certain Feature Words p if it existsk, weight
wk> wj, j ∈ [1, n] ∩ j ≠ k (11)
Then S is mainly by pkIt determines.It proves as follows:
If hi={ ai1, ai2, ai3..., aim, aijIt is a binary variable, then
Extract wk, then have
Because of wk> > wj, therefore:
So at this time:
Finally there is F mainly and pkIt is related, it was demonstrated that complete.
It is above to prove also to reflect influence of the weight to Simhash fingerprint simultaneously.
After introducing weight threshold, shown in such as formula of weight calculation at this time (16):
In conclusion E-Simhash algorithm flow is as shown in Figure 3.
E-Simhash algorithm has that following three points are different from traditional Simhash algorithm, mainly draws on the basis of TF-IDF
Enter comentropy and carry out term weight function calculating, and use the square mean number of the two as last term weight function, is simultaneously
The situation for avoiding weight excessively high leads to distortion of fingerprint, weight threshold is introduced, shown in calculation such as formula (16).Finally generating
Exclusive or is carried out with Feature Words position when Feature Words hash, making its hash includes the location distribution information of document.
Below with reference to specific experiment, the invention will be further described for emulation.
Emulation experiment and analysis: the true application scenarios of main analog of the present invention, the performance for verifying E-Simhash algorithm are
It is no more superior than traditional Si mhash algorithm.
Experimental situation and data set:
Experimental situation is deployed on a desktop computer, and machine parameter is as follows:
1 experimental situation parameter of table
The whole network news data of the data set in search dog laboratory 2012 editions, it is from more news sites nearly 20
The classified news of column rejects the data for being lower than 800 characters, and therefrom randomly selects 1565 progress subsequent experimentals.
First from 1565 news, according to modification ratio, randomly select several news modify, delete, shifting,
The random operations such as replacement, and control modified article and original article has the similarity of certain threshold value T, generate sample to be tested
Collection, is compared using traditional Si mhash with the algorithm in this patent, the index of correlation of statistical experiment later.
Analysis of experimental results
Four kinds of indexs are commonly used in experimental result to be assessed, and are duplicate removal rate, precision ratio, recall rate and F value respectively, wherein
Duplicate removal rate refers to the ratio of classify correct sample number and total sample, with regard to being predicted as same source article collection number and total for this experiment
The ratio of article number.
Below with reference to specific experiment, the invention will be further described.
Experiment one: the comparison of duplicate removal rate:
1162 are randomly selected in 1565 news and carries out any modification, chooses different Hamming distances, compare two kinds
Accuracy rate in algorithm, T=15% in test, i.e. every news are kept for no more 15% modification, and fingerprint length is 128, word
Weight threshold Wt=90, experimental result is illustrated in fig. 4 shown below.
The experimental results showed that E-Simhash algorithm all has very high duplicate removal rate when Hamming distances are greater than 2.In reality
Hamming distances generally take 10 or so in, so the duplicate removal effect of E-Simhash algorithm is more preferable.
Experiment two: modification T threshold comparison:
The present invention tests the similarity threshold T of modification text, respectively under 5%, 10%, 15%, 20% modification, hamming
Distance is selected as 10, i.e., thinks similar lower than 10, compare the duplicate removal rate of two kinds of algorithms.From experimental result it is as shown in Figure 5 in it is found that
E-Simhash algorithm duplicate removal rate is better than respectively with 0.833:0.679,0.751:0.529,0.687:0.476,0.661:0.451
Traditional Simhash algorithm, and with the increase that article changes, downward trend is all presented in duplicate removal rate.The experimental results showed that
At different modification threshold value T, E-Simhash algorithm is superior to traditional Simhash algorithm.
Experiment three: precision ratio, recall rate and the comparison of F value:
In an experiment, an article is randomly selected from news concentration to be modified at random, and guarantee there is 90% with original text
Similarity compares precision ratio, recall ratio and F1 value based on Simhash fingerprint Yu E-Simhash algorithm.Wherein Hamming distances
Choose 10;Experiment carries out 100 times, and takes their average value, as final result, as a result as shown in Figure 6.Pass through experimental data
It is found that E-Simhash algorithm is in precision ratio 0.963:0.818, recall rate 0.867:0.621, F1 value 0.912:0.706 is better than biography
The Simhash algorithm of system.The result shows that E-Simhash algorithm is in terms of precision ratio, recall rate and F value than common
Simhash algorithm has greatly improved, and also suffices to show that the superiority of E-Simhash algorithm.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.