CN109948125A

CN109948125A - Method and system of the improved Simhash algorithm in text duplicate removal

Info

Publication number: CN109948125A
Application number: CN201910225442.9A
Authority: CN
Inventors: 张仕斌; 张航; 盛志伟; 万国根
Original assignee: Chengdu University of Information Technology
Current assignee: Haikou Lingjie Information Technology Co.,Ltd.
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2019-06-28
Anticipated expiration: 2039-03-25
Also published as: CN109948125B

Abstract

The invention belongs to technical field of information processing, disclose a kind of method and system of improved Simhash algorithm in text duplicate removal, it is weighted to obtain weight using based on TF-IDF algorithm and comentropy, and be ranked up according to distribution in a document, exclusive or is carried out with feature vocabulary position again to the hash that each feature vocabulary generates；After improved weight calculation, weight threshold W is introduced_t, increase text distributed intelligence, so that the fingerprint ultimately generated is embodied the specific gravity of key message, and analyze the relevance of finger print information and weight.Emulation experiment shows that the present invention optimizes weight calculation and can effectively promote the performance of Simhash algorithm, and E-Simhash algorithm is superior to traditional Si mhash algorithm in terms of duplicate removal rate, recall rate, F, and good effect is achieved in terms of text duplicate removal.

Description

Method and system of the improved Simhash algorithm in text duplicate removal

Technical field

The invention belongs to technical field of information processing more particularly to a kind of improved Simhash algorithm in text duplicate removal Method and system.

Background technique

Currently, the prior art commonly used in the trade is such that

In terms of removing redundant data, Simhash algorithm is currently generally acknowledged best Duplicate Removal Algorithm.The algorithm is a kind of Local sensitivity hash algorithm, high dimensional data can be carried out probability dimensionality reduction and be mapped as the less and fixed fingerprint of digit by it, it Similarity-rough set is carried out to fingerprint to reflect the similarity degree between data again afterwards.Wherein similarity-rough set usually using hamming away from From or editing distance.Simhash algorithm advantage is that processing speed is fast, and result precision is high.

Nowadays, Simhash algorithm is widely used in the fields such as approximate text detection, redundant data duplicate removal, abnormality detection. Dong Bo, Zheng Qinghua etc. propose one kind based on more Simhash fingerprint algorithms, tie up more curved surfaces by k using a variety of fingerprint values and carry out Similarity calculation, efficiently solves that fingerprint is single, and information loses serious problem；Chen Bo, Pan Yongtao etc. are in Simhash algorithm In joined depreciation operation, a threshold value T is subtracted to most latter incorporated result sequence string result, thus improve Simhash calculation The accuracy of method.Simhash algorithm and CNN are combined by NI S, Qian Q etc. is used for malware detection, by being converted into Gray level image improves Malware discrimination and performance.

In conclusion problem of the existing technology is:

(1) in the prior art, data processing method processing speed is fast, but result precision is low.

(2) existing shortcoming of the Simhash algorithm in terms of weight calculation cannot embody crucial spy in the hash fingerprint of generation Levy item proportion.(3) fail to embody the distributed intelligence in file characteristics vocabulary in the prior art.

Solve the difficulty of above-mentioned technical problem:

In order to promote text duplicate removal effect, the accurate rate of Simhash algorithm, distribution can not be embodied by solving Simhash algorithm The shortcomings that information, introduces the concept of comentropy, carries out tax power to the keyword in document by the way of entropy weighting, optimizes weight Calculation formula, and keyword distributed intelligence is added in hash calculating, to reach the optimization to traditional Si mhash algorithm, most Pass through the simulation experiments feasibility of the algorithm, reasonability afterwards.

Solve the meaning of above-mentioned technical problem:

The algorithm introduces TF-IDF and comentropy, by the weight and threshold calculations in optimization Simhash algorithm, increases text This distributed intelligence, so that the fingerprint ultimately generated can more embody the specific gravity of key message, and being associated with to finger print information and weight Property is analyzed.Emulation experiment shows: optimization weight calculation can effectively promote the performance of Simhash algorithm, E-Simhash Algorithm is superior to traditional Si mhash algorithm in terms of duplicate removal rate, recall rate, F, and achieves in terms of text duplicate removal good Good effect.

Summary of the invention

In view of the problems of the existing technology, the present invention provides a kind of improved Simhash algorithms in text duplicate removal Method and system.

The invention is realized in this way a kind of method of improved Simhash algorithm in text duplicate removal includes: to utilize base It is weighted to obtain weight in TF-IDF algorithm and comentropy, and is ranked up according to distribution in a document, to each feature The hash that vocabulary generates carries out exclusive or with feature vocabulary position again；

After improved weight calculation, weight threshold W is introduced_t, increase text distributed intelligence, make the fingerprint body ultimately generated The specific gravity of existing key message, and the relevance of finger print information and weight is analyzed.

Further, method of the improved Simhash algorithm in text duplicate removal specifically includes:

Step 1:, initialization:

Simhash digit and f dimensional vector space are determined to data set size and carrying cost, while initializing the position f two System number s is set to 0；

Step 2, document pre-processes:

Document is segmented, stop words is gone to operate, constitutes several feature lexical item M={ } of document；

Step 3, weight calculation:

The TF-IDF value and left and right comentropy for calculating separately the characteristic item after participle, use square of TF value, IDF value The average value weight final as characteristic item, and introduce threshold value W_tFile characteristics are prevented to be distorted；

Step 4, hash is calculated:

Hash calculating is carried out to the characteristic item in step 2, and introduces location factor and hash progress xor operation as special The final hash value of item is levied, hash value includes the location information of characteristic item, H={ } is denoted as, wherein=hash ()；

Step 5:, it is cumulative that the position the f hash value generated in the Feature item weighting and step 4 generated in step 3 is carried out Lower column operations:

Step 6:, compressed transform:

To the second level fingerprint vector V ultimately produced, conversion processing is carried out to each, f hash of document is ultimately generated and refers to Line S；

Further, it is respectively { p that n keyword is extracted in document₁,p₂,p₃,…p_n, the weight of each keyword is W= {w₁,w₂,w₃,…,w_n}；Hash value is generated to n keyword, result is H={ h₁,h₂,h₃,…,h_n, it is generated after superposition Second level fingerprint F={ f₁, f₂, f₃... f_m, m is fingerprint digit, finally according to f in F_iWhether being greater than 0 generation Simhash fingerprint is S；

A certain Feature Words p if it exists_k, weight is

w_k> > w_j, j ∈ [1, n] ∩ j ≠ k；

Then s is by p_kIt determines.

Further, after introducing weight threshold, weight calculation is shown below:

Further, comentropy are as follows:

H (X)=- ∑ (x_i∈X)P(x_i)log₂P(x_i)

Wherein x indicates information probability space X=(x₁: P (x₁), x₂: P (x₂) ..., x_n: P (x_n)), H (X) indicates random and becomes Measure the probabilistic measurement of x.

Further, comentropy includes left and right comentropy, and formula is as follows:

In formula, W indicates some word, E_L(W) the left entropy of the word is indicated, and P (aW | W) indicate occur not on the left of the word With the probability of word, a variable is a changing value, indicates the vocabulary combined with W；E_RIt (W) is right entropy.

Further, entropy weighted calculation method includes:

Feature Words or so comentropy is averaged；Use H_k(w) the entropy information amount of the word is indicated；Entropy factor H_kIt is added In weight computing formula, take the square mean number of the two as word weight, as follows:

Feature Words t_kIn document d_jThe number of middle appearance is more, and the document for occurring the specific word in training set is fewer, and Its information content is bigger, then its weight is higher.

Another object of the present invention is to provide a kind of sides for implementing the improved Simhash algorithm in text duplicate removal Control system of the improved Simhash algorithm of method in text duplicate removal.

Another object of the present invention is to provide a kind of sides for implementing the improved Simhash algorithm in text duplicate removal Removal redundant data storage medium of the improved Simhash algorithm of method in text duplicate removal.

In conclusion advantages of the present invention and good effect are as follows:

The present invention has good effect in terms of removing redundant data, and Simhash algorithm is a kind of local sensitivity Hash Algorithm, high dimensional data can be carried out probability dimensionality reduction and be mapped as the less and fixed fingerprint of digit by it, later again to fingerprint into Row similarity-rough set reflects the similarity degree between data.Wherein similarity-rough set usually using Hamming distances or editor away from From.Simhash algorithm advantage is that processing speed is fast, and result precision is high.

The present invention for traditional Si mhash algorithm in terms of weight calculation shortcoming and algorithm in cannot in view of text The distributed intelligence of shelves feature vocabulary, the present invention are made by optimization weight calculation using TF-IDF and the square mean number of comentropy It is characterized the weighted value of word, it is contemplated that fractional weight is excessive to lead to information distortion, introduces weight threshold, and on this basis will be special The location information of sign word is introduced into hash calculating, to promote the duplicate removal rate of Simhash algorithm, precision ratio, and by imitative True experiment demonstrates E-Simhash algorithm and is superior to traditional Simhash algorithm in all respects.

By emulation experiment, uses the whole network news data in Sohu laboratory as document sets, compared in duplicate removal rate real In testing, experimental result is as shown in Figure 4.Modify similarity threshold T, E-Simhash algorithm duplicate removal rate respectively with 0.833:0.679, 0.751:0.529,0.687:0.476,0.661:0.451 are better than traditional Simhash algorithm, and the increasing changed with article Add, downward trend is all presented in duplicate removal rate, and the simulation experiment result is as shown in Figure 5.Finally in precision ratio, recall rate and F value In comparison, as shown in fig. 6, E-Simhash algorithm is in precision ratio 0.963:0.818, recall rate 0.867:0.621, F1 value 0.912:0.706 is better than traditional Simhash algorithm,.The simulation experiment result shows: optimization weight calculation can be promoted effectively The performance of Simhash algorithm, E-Simhash algorithm are superior to traditional Si mhash calculation in terms of duplicate removal rate, recall rate, F Method, and good effect is achieved in terms of text duplicate removal.

Detailed description of the invention

Fig. 1 is the algorithm flow chart of Simhash provided in an embodiment of the present invention.

Fig. 2 is influence diagram of the position provided in an embodiment of the present invention to Simhash.

Fig. 3 is E-Simhash algorithmic procedure figure provided in an embodiment of the present invention.

Fig. 4 is duplicate removal rate comparison diagram under different Hamming distances provided in an embodiment of the present invention.

Fig. 5 is duplicate removal rate comparison diagram under different threshold values provided in an embodiment of the present invention.

Fig. 6 is comprehensive comparison figure provided in an embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

In the prior art, data processing method processing speed is fast, and result precision is low.

To solve the above problems, being described in detail below with reference to concrete analysis to the present invention.

Method of the improved Simhash algorithm provided in an embodiment of the present invention in text duplicate removal includes: using based on TF- IDF algorithm and comentropy are weighted to obtain weight, and are ranked up according to distribution in a document, raw to each feature vocabulary At hash again with feature vocabulary position carry out exclusive or；

Below with reference to defined analysis, the invention will be further described.

1, the analysis of Simhash algorithm

The principle for defining a SimHash algorithm is variable x, the y given for two, under hash function h always meets Formula:

Pr_h∈F(h (x)=h (y))=sim (x, y) (1)

Wherein, sim (x, y) ∈ [0,1] is similarity function, and variable x, the phase of y are generally also indicated with Jacobian function Like degree, sim (x, y) is expressed as follows:

H belongs to hash function cluster F, and it is necessary to meet following condition:

If 1) d (x, y)≤d₁, then Pr_h∈F(h (x)=h (y)) >=p₁；

If 2) d (x, y) >=d₂, then Pr_h∈F(h (x)=h (y))≤p₂。

F is referred to as (d₁, d₂, p₁, p₁) on sensitive hash cluster function.Wherein d (x, y) expression x, the distance between y variable, It for popular, indicates if when x, y are similar enough, then the probability that they are mapped as same hash function is also just sufficiently large, instead Cryptographic Hash it is equal probability it is sufficiently small.

Since traditional hash function and the maximum difference of Simhash function are local susceptibility, if for input Data make a little a little modifications in part, may obtain after traditional hash functional operation entirely different as a result, and Simhash The result of calculating is then much like, therefore the fingerprint similarity degree of Simhash function generation can be used to indicate between source data Similarity degree.

2, Simhash algorithm flow:

The process of Simhash algorithm is to define the space of a f dimension first, then defines each in this space Vector corresponding to feature, then combine itself weight to be weighted all vectors, sum just obtained one and to Amount is as a result.Compression conversion is further finally carried out to the result again, rule is: a phase is obtained to each vector Corresponding f signing messages, if the value of vector dimension is greater than 0, the position where setting its signature is 1, is otherwise set to 0.Pass through Such transform mode, the information of value of the obtained signing messages with regard to exterior syndrome this vector in each dimension.

The algorithm flow chart of Simhash is as shown in Figure 1.Specific step is as follows for Simhash algorithm:

Step 1: initialization

Size and carrying cost determine Simhash digit and f dimensional vector space for data sets, while initializing f Binary number s is set to 0.

Step 2: document pretreatment

It mainly include two parts, first part is participle, finds feature vocabulary and removal document stop words of document etc.. Second is exactly to assign power, it is however generally that the calculating and setting for generally having ignored weight here is 1.

Step 3: generating hash value

One f hash value is calculated to each Feature Words in step 2 using traditional hashing algorithm, and is carried out following Operation:

Walk mule 4: compressed transform:

For the vector V ultimately produced, conversion processing is carried out to each.

Step 5: fingerprint generates: exporting fingerprint of the final signature S as the document, carries out Hamming distances or volume again later It collects distance and calculates similarity.

Step 6: distance calculates: carrying out similarity calculation using Hamming distances in Simhash algorithm.Hamming distances pass through Compare in two document fingerprints different number to measure the similarity between two documents.Hamming distances are bigger, represent two The similarity of a character string is lower, on the contrary then two similarity of character string are higher.For string of binary characters, it can be used XOR operation calculates two binary Hamming distances.

The present invention will be further described below with reference to examples.

Embodiment 1:

If a, b are two binary numbers, wherein a=00110, b=01110.A known to then, two binary numbers of b only have the Two differences, therefore Hamming (a, b)=1.Xor operation can also be used, count in exclusive or result 1 number. 11 is shared, therefore Hamming distances are 1.

Traditional Si mhash algorithm is usually arranged as the number of 1 or Feature Words appearance in terms of weight calculation, this is easy to make It is lost at information, causes final Simhash fingerprint accuracy to reduce, and it does not show according to Simhash algorithm Vocabulary distributed intelligence, key feature tone is whole along after, will not influence the Simhash fingerprint ultimately generated.

Differ widely as shown in Fig. 2, this may result in final meaning under the position adjustment of two keywords, but it is traditional The fingerprint that generates of Simhash algorithm be the same.

Embodiment 2

In order to promote text duplicate removal effect, the accuracy rate of Simhash algorithm, distribution can not be embodied by solving Simhash algorithm The shortcomings that information, proposes a kind of Simhash algorithm (abbreviation E-Simhash) based on comentropy weighting.The algorithm introduces TF- IDF and comentropy increase text distributed intelligence by the weight and threshold calculations in optimization Simhash algorithm, so that most throughout one's life At fingerprint can more embody the specific gravity of key message, and the relevance of finger print information and weight is analyzed.

Emulation experiment shows: optimization weight calculation can effectively promote the performance of Simhash algorithm, E-Simhash algorithm It is superior to traditional Si mhash algorithm in terms of duplicate removal rate, recall rate, F, and achieves in terms of text duplicate removal good Effect.

In the present invention, the reverse document-frequency of (1) word frequency-includes:

Reverse document-frequency (TF-IDF) algorithm of word frequency-is a kind of common text feature weighing computation method, Feature Words t_kIn document d_jIn TF-IDF value be denoted as tfidf (t_k, d_j), just like giving a definition:

Define two: feature time t_kIn document d_jFrequency tf (the t of middle appearance_k, d_j) be

N in formula_{J, k}Indicate Feature Words t_kIn document d_jThe number of middle appearance, ∑_in_{J, i}Indicate document d_jIn all Feature Words Number.

Define three: anti-document frequency idf (t_k, d_j) it is the coefficient for weighing Feature Words importance, is defined as:

In formula: { j:t_k∈d_jIt is to contain Feature Words t_kDocument summary, | D | for the total number of files in corpus.

Define four: TF-IDF function, the word frequency weight definition of Feature Words are as follows:

w_k=tfidf (t_k, d_j)=tf (t_k, d_j)*idf(t_k) (5)

In the present invention, (2) comentropy includes:

Comentropy indicates the probabilistic measurement of result before chance event generation, and it occurs in chance event Afterwards, people's information content obtained in the event.

According to the definition of comentropy:

H (X)=- ∑ (x_i∈X)P(x_i)log₂P(x_i) (6)

In the present invention, (3) left and right comentropy

Left and right entropy refers to the entropy of the left margin of multi-character words expression and the entropy of right margin.The formula of left and right entropy is as follows:

In formula, W indicates some word, E_L(W) the left entropy of the word is indicated, and P (aW | W) indicate occur not on the left of the word With the probability of word, a variable is a changing value, indicates the vocabulary combined with W.E_R(W) it is same as above for right entropy.

In the present invention, (4) entropy weighted calculation method includes:

The present invention uses entropy weight computation method

Here Feature Words or so comentropy is averaged.Use H_k(w) the entropy information amount of the word is indicated.Entropy factor H_k It is added in weight computing formula, takes the square mean number of the two as word weight, as follows:

The physical significance of above formula are as follows: Feature Words t_kIn document d_jThe number of middle appearance is more, occurs this feature in training set The document of word is fewer, and its information content is bigger, then its weight is higher.

In the present invention, the Simhash algorithm (E-Simhash) based on entropy weighting specifically includes:

It is weighted to obtain weight first with based on TF-IDF algorithm and comentropy, and according to its distribution in a document It is ranked up, exclusive or will be carried out with its position again for the hash that each feature vocabulary generates.

But after improved weight calculation, due to the factors such as imperfect of training set, Partial Feature time power will lead to Weight is excessive, finally precision ratio is caused to decline, and in order to solve this problem, introduces weight threshold W_t.Weight unevenness is caused below The problem of proved.

If extracting n keyword in a document is respectively { p₁, p₂, p₃... p_n, the weight of each keyword is W= {w₁, w₂, w₃..., w_n}.Hash value is generated to n keyword, as a result H={ h₁, h₂, h₃..., h_n, it is raw after superposition At second level fingerprint F={ f₁, f₂, f₃... f_m, m is fingerprint digit, finally according to f in F_iWhether it is greater than 0 and generates Simhash fingerprint For S.

Then a certain Feature Words p if it exists_k, weight

w_k> w_j, j ∈ [1, n] ∩ j ≠ k (11)

Then S is mainly by p_kIt determines.It proves as follows:

If h_i={ a_i1, a_i2, a_i3..., a_im, a_ijIt is a binary variable, then

Extract w_k, then have

Because of w_k> > w_j, therefore:

So at this time:

Finally there is F mainly and p_kIt is related, it was demonstrated that complete.

It is above to prove also to reflect influence of the weight to Simhash fingerprint simultaneously.

After introducing weight threshold, shown in such as formula of weight calculation at this time (16):

In conclusion E-Simhash algorithm flow is as shown in Figure 3.

E-Simhash algorithm has that following three points are different from traditional Simhash algorithm, mainly draws on the basis of TF-IDF Enter comentropy and carry out term weight function calculating, and use the square mean number of the two as last term weight function, is simultaneously The situation for avoiding weight excessively high leads to distortion of fingerprint, weight threshold is introduced, shown in calculation such as formula (16).Finally generating Exclusive or is carried out with Feature Words position when Feature Words hash, making its hash includes the location distribution information of document.

Below with reference to specific experiment, the invention will be further described for emulation.

Emulation experiment and analysis: the true application scenarios of main analog of the present invention, the performance for verifying E-Simhash algorithm are It is no more superior than traditional Si mhash algorithm.

Experimental situation and data set:

Experimental situation is deployed on a desktop computer, and machine parameter is as follows:

1 experimental situation parameter of table

The whole network news data of the data set in search dog laboratory 2012 editions, it is from more news sites nearly 20 The classified news of column rejects the data for being lower than 800 characters, and therefrom randomly selects 1565 progress subsequent experimentals.

First from 1565 news, according to modification ratio, randomly select several news modify, delete, shifting, The random operations such as replacement, and control modified article and original article has the similarity of certain threshold value T, generate sample to be tested Collection, is compared using traditional Si mhash with the algorithm in this patent, the index of correlation of statistical experiment later.

Analysis of experimental results

Four kinds of indexs are commonly used in experimental result to be assessed, and are duplicate removal rate, precision ratio, recall rate and F value respectively, wherein Duplicate removal rate refers to the ratio of classify correct sample number and total sample, with regard to being predicted as same source article collection number and total for this experiment The ratio of article number.

Below with reference to specific experiment, the invention will be further described.

Experiment one: the comparison of duplicate removal rate:

1162 are randomly selected in 1565 news and carries out any modification, chooses different Hamming distances, compare two kinds Accuracy rate in algorithm, T=15% in test, i.e. every news are kept for no more 15% modification, and fingerprint length is 128, word Weight threshold W_t=90, experimental result is illustrated in fig. 4 shown below.

The experimental results showed that E-Simhash algorithm all has very high duplicate removal rate when Hamming distances are greater than 2.In reality Hamming distances generally take 10 or so in, so the duplicate removal effect of E-Simhash algorithm is more preferable.

Experiment two: modification T threshold comparison:

The present invention tests the similarity threshold T of modification text, respectively under 5%, 10%, 15%, 20% modification, hamming Distance is selected as 10, i.e., thinks similar lower than 10, compare the duplicate removal rate of two kinds of algorithms.From experimental result it is as shown in Figure 5 in it is found that E-Simhash algorithm duplicate removal rate is better than respectively with 0.833:0.679,0.751:0.529,0.687:0.476,0.661:0.451 Traditional Simhash algorithm, and with the increase that article changes, downward trend is all presented in duplicate removal rate.The experimental results showed that At different modification threshold value T, E-Simhash algorithm is superior to traditional Simhash algorithm.

Experiment three: precision ratio, recall rate and the comparison of F value:

In an experiment, an article is randomly selected from news concentration to be modified at random, and guarantee there is 90% with original text Similarity compares precision ratio, recall ratio and F1 value based on Simhash fingerprint Yu E-Simhash algorithm.Wherein Hamming distances Choose 10；Experiment carries out 100 times, and takes their average value, as final result, as a result as shown in Figure 6.Pass through experimental data It is found that E-Simhash algorithm is in precision ratio 0.963:0.818, recall rate 0.867:0.621, F1 value 0.912:0.706 is better than biography The Simhash algorithm of system.The result shows that E-Simhash algorithm is in terms of precision ratio, recall rate and F value than common Simhash algorithm has greatly improved, and also suffices to show that the superiority of E-Simhash algorithm.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of method of improved Simhash algorithm in text duplicate removal, which is characterized in that the improved Simhash is calculated Method of the method in text duplicate removal include: be weighted to obtain weight using based on TF-IDF algorithm and comentropy, and according to Distribution in document is ranked up, and carries out exclusive or with feature vocabulary position again to the hash that each feature vocabulary generates；

After improved weight calculation, weight threshold W is introduced_t, increase text distributed intelligence, embody the fingerprint ultimately generated and close The specific gravity of key information, and the relevance of finger print information and weight is analyzed.

2. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that described to change Into method of the Simhash algorithm in text duplicate removal specifically include:

Step 1:, initialization:

Simhash digit and f dimensional vector space are determined to data set size and carrying cost, while initializing f binary systems Number S is set to 0；

Step 2, document pre-processes:

Step 3, weight calculation:

The TF-IDF value and left and right comentropy for calculating separately the characteristic item after participle, use the square mean of TF value, IDF value It is worth the weight final as characteristic item, and introduces threshold value W_tFile characteristics are prevented to be distorted；

Step 4, hash is calculated:

Hash calculating is carried out to the characteristic item in step 2, and introduces location factor and hash progress xor operation as characteristic item Final hash value, hash value includes the location information of characteristic item, H={ } is denoted as, wherein=hash ()；

Step 5, it adds up and following fortune is carried out to the position the f hash value generated in the Feature item weighting and step 4 generated in step 3 It calculates:

Step 6:, compressed transform: to the second level fingerprint vector V ultimately produced, conversion processing is carried out to each, ultimately generates text F hash fingerprint S of shelves；

3. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that in document Extracting n keyword is respectively { p₁,p₂,p₃,…p_n, the weight of each keyword is W={ w₁,w₂,w₃,…,w_n}；N is closed Keyword generates hash value, and result is H={ h₁,h₂,h₃,…,h_n, second level fingerprint F={ f is generated after superposition₁, f₂, f₃, ...f_m, m is fingerprint digit, finally according to f in F_iWhether being greater than 0 to generate Simhash fingerprint is S；

A certain Feature Words p if it exists_k, weight is

w_k> > w_j, j ∈ [1, n] ∩ j ≠ k；

Then S is by p_kIt determines.

4. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that introduce power After weight threshold value, weight calculation is shown below:

5. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that comentropy Are as follows:

H (X)=- ∑ (x_i∈X)P(x_i)log₂P(x_i)

Wherein X indicates information probability space X=(x₁: P (x₁), x₂: P (x₂) ..., x_n: P (x_n)), H (X) indicates stochastic variable X Probabilistic measurement.

6. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that comentropy Including left and right comentropy, formula is as follows:

In formula, W indicates some word, E_L(W) the left entropy of the word is indicated, and P (aW | W) indicate occur different words on the left of the word Probability, a variable are a changing values, indicate the vocabulary combined with W；E_RIt (W) is right entropy.

7. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that entropy weighting Calculating method includes:

Feature Words or so comentropy is averaged；Use H_k(w) the entropy information amount of the word is indicated；Entropy factor H_kWeight is added In calculation formula, take the square mean number of the two as word weight, as follows:

Feature Words t_kIn document d_jThe number of middle appearance is more, and the document for occurring the specific word in training set is fewer, and it is believed Breath amount is bigger, then its weight is higher.

8. a kind of implement the improved of method of the improved Simhash algorithm in text duplicate removal as described in claim 1 Control system of the Simhash algorithm in text duplicate removal.

9. a kind of implement the improved of method of the improved Simhash algorithm in text duplicate removal as described in claim 1 Removal redundant data storage medium of the Simhash algorithm in text duplicate removal.