CN109948125A - Method and system of the improved Simhash algorithm in text duplicate removal - Google Patents

Method and system of the improved Simhash algorithm in text duplicate removal Download PDF

Info

Publication number
CN109948125A
CN109948125A CN201910225442.9A CN201910225442A CN109948125A CN 109948125 A CN109948125 A CN 109948125A CN 201910225442 A CN201910225442 A CN 201910225442A CN 109948125 A CN109948125 A CN 109948125A
Authority
CN
China
Prior art keywords
weight
duplicate removal
algorithm
simhash
improved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910225442.9A
Other languages
Chinese (zh)
Other versions
CN109948125B (en
Inventor
张仕斌
张航
盛志伟
万国根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haikou Lingjie Information Technology Co.,Ltd.
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN201910225442.9A priority Critical patent/CN109948125B/en
Publication of CN109948125A publication Critical patent/CN109948125A/en
Application granted granted Critical
Publication of CN109948125B publication Critical patent/CN109948125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention belongs to technical field of information processing, disclose a kind of method and system of improved Simhash algorithm in text duplicate removal, it is weighted to obtain weight using based on TF-IDF algorithm and comentropy, and be ranked up according to distribution in a document, exclusive or is carried out with feature vocabulary position again to the hash that each feature vocabulary generates;After improved weight calculation, weight threshold W is introducedt, increase text distributed intelligence, so that the fingerprint ultimately generated is embodied the specific gravity of key message, and analyze the relevance of finger print information and weight.Emulation experiment shows that the present invention optimizes weight calculation and can effectively promote the performance of Simhash algorithm, and E-Simhash algorithm is superior to traditional Si mhash algorithm in terms of duplicate removal rate, recall rate, F, and good effect is achieved in terms of text duplicate removal.

Description

Method and system of the improved Simhash algorithm in text duplicate removal
Technical field
The invention belongs to technical field of information processing more particularly to a kind of improved Simhash algorithm in text duplicate removal Method and system.
Background technique
Currently, the prior art commonly used in the trade is such that
In terms of removing redundant data, Simhash algorithm is currently generally acknowledged best Duplicate Removal Algorithm.The algorithm is a kind of Local sensitivity hash algorithm, high dimensional data can be carried out probability dimensionality reduction and be mapped as the less and fixed fingerprint of digit by it, it Similarity-rough set is carried out to fingerprint to reflect the similarity degree between data again afterwards.Wherein similarity-rough set usually using hamming away from From or editing distance.Simhash algorithm advantage is that processing speed is fast, and result precision is high.
Nowadays, Simhash algorithm is widely used in the fields such as approximate text detection, redundant data duplicate removal, abnormality detection. Dong Bo, Zheng Qinghua etc. propose one kind based on more Simhash fingerprint algorithms, tie up more curved surfaces by k using a variety of fingerprint values and carry out Similarity calculation, efficiently solves that fingerprint is single, and information loses serious problem;Chen Bo, Pan Yongtao etc. are in Simhash algorithm In joined depreciation operation, a threshold value T is subtracted to most latter incorporated result sequence string result, thus improve Simhash calculation The accuracy of method.Simhash algorithm and CNN are combined by NI S, Qian Q etc. is used for malware detection, by being converted into Gray level image improves Malware discrimination and performance.
In conclusion problem of the existing technology is:
(1) in the prior art, data processing method processing speed is fast, but result precision is low.
(2) existing shortcoming of the Simhash algorithm in terms of weight calculation cannot embody crucial spy in the hash fingerprint of generation Levy item proportion.(3) fail to embody the distributed intelligence in file characteristics vocabulary in the prior art.
Solve the difficulty of above-mentioned technical problem:
In order to promote text duplicate removal effect, the accurate rate of Simhash algorithm, distribution can not be embodied by solving Simhash algorithm The shortcomings that information, introduces the concept of comentropy, carries out tax power to the keyword in document by the way of entropy weighting, optimizes weight Calculation formula, and keyword distributed intelligence is added in hash calculating, to reach the optimization to traditional Si mhash algorithm, most Pass through the simulation experiments feasibility of the algorithm, reasonability afterwards.
Solve the meaning of above-mentioned technical problem:
The algorithm introduces TF-IDF and comentropy, by the weight and threshold calculations in optimization Simhash algorithm, increases text This distributed intelligence, so that the fingerprint ultimately generated can more embody the specific gravity of key message, and being associated with to finger print information and weight Property is analyzed.Emulation experiment shows: optimization weight calculation can effectively promote the performance of Simhash algorithm, E-Simhash Algorithm is superior to traditional Si mhash algorithm in terms of duplicate removal rate, recall rate, F, and achieves in terms of text duplicate removal good Good effect.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of improved Simhash algorithms in text duplicate removal Method and system.
The invention is realized in this way a kind of method of improved Simhash algorithm in text duplicate removal includes: to utilize base It is weighted to obtain weight in TF-IDF algorithm and comentropy, and is ranked up according to distribution in a document, to each feature The hash that vocabulary generates carries out exclusive or with feature vocabulary position again;
After improved weight calculation, weight threshold W is introducedt, increase text distributed intelligence, make the fingerprint body ultimately generated The specific gravity of existing key message, and the relevance of finger print information and weight is analyzed.
Further, method of the improved Simhash algorithm in text duplicate removal specifically includes:
Step 1:, initialization:
Simhash digit and f dimensional vector space are determined to data set size and carrying cost, while initializing the position f two System number s is set to 0;
Step 2, document pre-processes:
Document is segmented, stop words is gone to operate, constitutes several feature lexical item M={ } of document;
Step 3, weight calculation:
The TF-IDF value and left and right comentropy for calculating separately the characteristic item after participle, use square of TF value, IDF value The average value weight final as characteristic item, and introduce threshold value WtFile characteristics are prevented to be distorted;
Step 4, hash is calculated:
Hash calculating is carried out to the characteristic item in step 2, and introduces location factor and hash progress xor operation as special The final hash value of item is levied, hash value includes the location information of characteristic item, H={ } is denoted as, wherein=hash ();
Step 5:, it is cumulative that the position the f hash value generated in the Feature item weighting and step 4 generated in step 3 is carried out Lower column operations:
Step 6:, compressed transform:
To the second level fingerprint vector V ultimately produced, conversion processing is carried out to each, f hash of document is ultimately generated and refers to Line S;
Further, it is respectively { p that n keyword is extracted in document1,p2,p3,…pn, the weight of each keyword is W= {w1,w2,w3,…,wn};Hash value is generated to n keyword, result is H={ h1,h2,h3,…,hn, it is generated after superposition Second level fingerprint F={ f1, f2, f3... fm, m is fingerprint digit, finally according to f in FiWhether being greater than 0 generation Simhash fingerprint is S;
A certain Feature Words p if it existsk, weight is
wk> > wj, j ∈ [1, n] ∩ j ≠ k;
Then s is by pkIt determines.
Further, after introducing weight threshold, weight calculation is shown below:
Further, comentropy are as follows:
H (X)=- ∑ (xi∈X)P(xi)log2P(xi)
Wherein x indicates information probability space X=(x1: P (x1), x2: P (x2) ..., xn: P (xn)), H (X) indicates random and becomes Measure the probabilistic measurement of x.
Further, comentropy includes left and right comentropy, and formula is as follows:
In formula, W indicates some word, EL(W) the left entropy of the word is indicated, and P (aW | W) indicate occur not on the left of the word With the probability of word, a variable is a changing value, indicates the vocabulary combined with W;ERIt (W) is right entropy.
Further, entropy weighted calculation method includes:
Feature Words or so comentropy is averaged;Use Hk(w) the entropy information amount of the word is indicated;Entropy factor HkIt is added In weight computing formula, take the square mean number of the two as word weight, as follows:
Feature Words tkIn document djThe number of middle appearance is more, and the document for occurring the specific word in training set is fewer, and Its information content is bigger, then its weight is higher.
Another object of the present invention is to provide a kind of sides for implementing the improved Simhash algorithm in text duplicate removal Control system of the improved Simhash algorithm of method in text duplicate removal.
Another object of the present invention is to provide a kind of sides for implementing the improved Simhash algorithm in text duplicate removal Removal redundant data storage medium of the improved Simhash algorithm of method in text duplicate removal.
In conclusion advantages of the present invention and good effect are as follows:
The present invention has good effect in terms of removing redundant data, and Simhash algorithm is a kind of local sensitivity Hash Algorithm, high dimensional data can be carried out probability dimensionality reduction and be mapped as the less and fixed fingerprint of digit by it, later again to fingerprint into Row similarity-rough set reflects the similarity degree between data.Wherein similarity-rough set usually using Hamming distances or editor away from From.Simhash algorithm advantage is that processing speed is fast, and result precision is high.
The present invention for traditional Si mhash algorithm in terms of weight calculation shortcoming and algorithm in cannot in view of text The distributed intelligence of shelves feature vocabulary, the present invention are made by optimization weight calculation using TF-IDF and the square mean number of comentropy It is characterized the weighted value of word, it is contemplated that fractional weight is excessive to lead to information distortion, introduces weight threshold, and on this basis will be special The location information of sign word is introduced into hash calculating, to promote the duplicate removal rate of Simhash algorithm, precision ratio, and by imitative True experiment demonstrates E-Simhash algorithm and is superior to traditional Simhash algorithm in all respects.
By emulation experiment, uses the whole network news data in Sohu laboratory as document sets, compared in duplicate removal rate real In testing, experimental result is as shown in Figure 4.Modify similarity threshold T, E-Simhash algorithm duplicate removal rate respectively with 0.833:0.679, 0.751:0.529,0.687:0.476,0.661:0.451 are better than traditional Simhash algorithm, and the increasing changed with article Add, downward trend is all presented in duplicate removal rate, and the simulation experiment result is as shown in Figure 5.Finally in precision ratio, recall rate and F value In comparison, as shown in fig. 6, E-Simhash algorithm is in precision ratio 0.963:0.818, recall rate 0.867:0.621, F1 value 0.912:0.706 is better than traditional Simhash algorithm,.The simulation experiment result shows: optimization weight calculation can be promoted effectively The performance of Simhash algorithm, E-Simhash algorithm are superior to traditional Si mhash calculation in terms of duplicate removal rate, recall rate, F Method, and good effect is achieved in terms of text duplicate removal.
Detailed description of the invention
Fig. 1 is the algorithm flow chart of Simhash provided in an embodiment of the present invention.
Fig. 2 is influence diagram of the position provided in an embodiment of the present invention to Simhash.
Fig. 3 is E-Simhash algorithmic procedure figure provided in an embodiment of the present invention.
Fig. 4 is duplicate removal rate comparison diagram under different Hamming distances provided in an embodiment of the present invention.
Fig. 5 is duplicate removal rate comparison diagram under different threshold values provided in an embodiment of the present invention.
Fig. 6 is comprehensive comparison figure provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
In the prior art, data processing method processing speed is fast, and result precision is low.
To solve the above problems, being described in detail below with reference to concrete analysis to the present invention.
Method of the improved Simhash algorithm provided in an embodiment of the present invention in text duplicate removal includes: using based on TF- IDF algorithm and comentropy are weighted to obtain weight, and are ranked up according to distribution in a document, raw to each feature vocabulary At hash again with feature vocabulary position carry out exclusive or;
After improved weight calculation, weight threshold W is introducedt, increase text distributed intelligence, make the fingerprint body ultimately generated The specific gravity of existing key message, and the relevance of finger print information and weight is analyzed.
Below with reference to defined analysis, the invention will be further described.
1, the analysis of Simhash algorithm
The principle for defining a SimHash algorithm is variable x, the y given for two, under hash function h always meets Formula:
Prh∈F(h (x)=h (y))=sim (x, y) (1)
Wherein, sim (x, y) ∈ [0,1] is similarity function, and variable x, the phase of y are generally also indicated with Jacobian function Like degree, sim (x, y) is expressed as follows:
H belongs to hash function cluster F, and it is necessary to meet following condition:
If 1) d (x, y)≤d1, then Prh∈F(h (x)=h (y)) >=p1
If 2) d (x, y) >=d2, then Prh∈F(h (x)=h (y))≤p2
F is referred to as (d1, d2, p1, p1) on sensitive hash cluster function.Wherein d (x, y) expression x, the distance between y variable, It for popular, indicates if when x, y are similar enough, then the probability that they are mapped as same hash function is also just sufficiently large, instead Cryptographic Hash it is equal probability it is sufficiently small.
Since traditional hash function and the maximum difference of Simhash function are local susceptibility, if for input Data make a little a little modifications in part, may obtain after traditional hash functional operation entirely different as a result, and Simhash The result of calculating is then much like, therefore the fingerprint similarity degree of Simhash function generation can be used to indicate between source data Similarity degree.
2, Simhash algorithm flow:
The process of Simhash algorithm is to define the space of a f dimension first, then defines each in this space Vector corresponding to feature, then combine itself weight to be weighted all vectors, sum just obtained one and to Amount is as a result.Compression conversion is further finally carried out to the result again, rule is: a phase is obtained to each vector Corresponding f signing messages, if the value of vector dimension is greater than 0, the position where setting its signature is 1, is otherwise set to 0.Pass through Such transform mode, the information of value of the obtained signing messages with regard to exterior syndrome this vector in each dimension.
The algorithm flow chart of Simhash is as shown in Figure 1.Specific step is as follows for Simhash algorithm:
Step 1: initialization
Size and carrying cost determine Simhash digit and f dimensional vector space for data sets, while initializing f Binary number s is set to 0.
Step 2: document pretreatment
It mainly include two parts, first part is participle, finds feature vocabulary and removal document stop words of document etc.. Second is exactly to assign power, it is however generally that the calculating and setting for generally having ignored weight here is 1.
Step 3: generating hash value
One f hash value is calculated to each Feature Words in step 2 using traditional hashing algorithm, and is carried out following Operation:
Walk mule 4: compressed transform:
For the vector V ultimately produced, conversion processing is carried out to each.
Step 5: fingerprint generates: exporting fingerprint of the final signature S as the document, carries out Hamming distances or volume again later It collects distance and calculates similarity.
Step 6: distance calculates: carrying out similarity calculation using Hamming distances in Simhash algorithm.Hamming distances pass through Compare in two document fingerprints different number to measure the similarity between two documents.Hamming distances are bigger, represent two The similarity of a character string is lower, on the contrary then two similarity of character string are higher.For string of binary characters, it can be used XOR operation calculates two binary Hamming distances.
The present invention will be further described below with reference to examples.
Embodiment 1:
If a, b are two binary numbers, wherein a=00110, b=01110.A known to then, two binary numbers of b only have the Two differences, therefore Hamming (a, b)=1.Xor operation can also be used, count in exclusive or result 1 number. 11 is shared, therefore Hamming distances are 1.
Traditional Si mhash algorithm is usually arranged as the number of 1 or Feature Words appearance in terms of weight calculation, this is easy to make It is lost at information, causes final Simhash fingerprint accuracy to reduce, and it does not show according to Simhash algorithm Vocabulary distributed intelligence, key feature tone is whole along after, will not influence the Simhash fingerprint ultimately generated.
Differ widely as shown in Fig. 2, this may result in final meaning under the position adjustment of two keywords, but it is traditional The fingerprint that generates of Simhash algorithm be the same.
Embodiment 2
In order to promote text duplicate removal effect, the accuracy rate of Simhash algorithm, distribution can not be embodied by solving Simhash algorithm The shortcomings that information, proposes a kind of Simhash algorithm (abbreviation E-Simhash) based on comentropy weighting.The algorithm introduces TF- IDF and comentropy increase text distributed intelligence by the weight and threshold calculations in optimization Simhash algorithm, so that most throughout one's life At fingerprint can more embody the specific gravity of key message, and the relevance of finger print information and weight is analyzed.
Emulation experiment shows: optimization weight calculation can effectively promote the performance of Simhash algorithm, E-Simhash algorithm It is superior to traditional Si mhash algorithm in terms of duplicate removal rate, recall rate, F, and achieves in terms of text duplicate removal good Effect.
In the present invention, the reverse document-frequency of (1) word frequency-includes:
Reverse document-frequency (TF-IDF) algorithm of word frequency-is a kind of common text feature weighing computation method, Feature Words tkIn document djIn TF-IDF value be denoted as tfidf (tk, dj), just like giving a definition:
Define two: feature time tkIn document djFrequency tf (the t of middle appearancek, dj) be
N in formulaJ, kIndicate Feature Words tkIn document djThe number of middle appearance, ∑inJ, iIndicate document djIn all Feature Words Number.
Define three: anti-document frequency idf (tk, dj) it is the coefficient for weighing Feature Words importance, is defined as:
In formula: { j:tk∈djIt is to contain Feature Words tkDocument summary, | D | for the total number of files in corpus.
Define four: TF-IDF function, the word frequency weight definition of Feature Words are as follows:
wk=tfidf (tk, dj)=tf (tk, dj)*idf(tk) (5)
In the present invention, (2) comentropy includes:
Comentropy indicates the probabilistic measurement of result before chance event generation, and it occurs in chance event Afterwards, people's information content obtained in the event.
According to the definition of comentropy:
H (X)=- ∑ (xi∈X)P(xi)log2P(xi) (6)
Wherein X indicates information probability space x=(x1: P (x1), x2: P (x2) ..., xn: P (xn)), H (X) indicates random and becomes Measure the probabilistic measurement of X.
In the present invention, (3) left and right comentropy
Left and right entropy refers to the entropy of the left margin of multi-character words expression and the entropy of right margin.The formula of left and right entropy is as follows:
In formula, W indicates some word, EL(W) the left entropy of the word is indicated, and P (aW | W) indicate occur not on the left of the word With the probability of word, a variable is a changing value, indicates the vocabulary combined with W.ER(W) it is same as above for right entropy.
In the present invention, (4) entropy weighted calculation method includes:
The present invention uses entropy weight computation method
Here Feature Words or so comentropy is averaged.Use Hk(w) the entropy information amount of the word is indicated.Entropy factor Hk It is added in weight computing formula, takes the square mean number of the two as word weight, as follows:
The physical significance of above formula are as follows: Feature Words tkIn document djThe number of middle appearance is more, occurs this feature in training set The document of word is fewer, and its information content is bigger, then its weight is higher.
In the present invention, the Simhash algorithm (E-Simhash) based on entropy weighting specifically includes:
It is weighted to obtain weight first with based on TF-IDF algorithm and comentropy, and according to its distribution in a document It is ranked up, exclusive or will be carried out with its position again for the hash that each feature vocabulary generates.
But after improved weight calculation, due to the factors such as imperfect of training set, Partial Feature time power will lead to Weight is excessive, finally precision ratio is caused to decline, and in order to solve this problem, introduces weight threshold Wt.Weight unevenness is caused below The problem of proved.
If extracting n keyword in a document is respectively { p1, p2, p3... pn, the weight of each keyword is W= {w1, w2, w3..., wn}.Hash value is generated to n keyword, as a result H={ h1, h2, h3..., hn, it is raw after superposition At second level fingerprint F={ f1, f2, f3... fm, m is fingerprint digit, finally according to f in FiWhether it is greater than 0 and generates Simhash fingerprint For S.
Then a certain Feature Words p if it existsk, weight
wk> wj, j ∈ [1, n] ∩ j ≠ k (11)
Then S is mainly by pkIt determines.It proves as follows:
If hi={ ai1, ai2, ai3..., aim, aijIt is a binary variable, then
Extract wk, then have
Because of wk> > wj, therefore:
So at this time:
Finally there is F mainly and pkIt is related, it was demonstrated that complete.
It is above to prove also to reflect influence of the weight to Simhash fingerprint simultaneously.
After introducing weight threshold, shown in such as formula of weight calculation at this time (16):
In conclusion E-Simhash algorithm flow is as shown in Figure 3.
E-Simhash algorithm has that following three points are different from traditional Simhash algorithm, mainly draws on the basis of TF-IDF Enter comentropy and carry out term weight function calculating, and use the square mean number of the two as last term weight function, is simultaneously The situation for avoiding weight excessively high leads to distortion of fingerprint, weight threshold is introduced, shown in calculation such as formula (16).Finally generating Exclusive or is carried out with Feature Words position when Feature Words hash, making its hash includes the location distribution information of document.
Below with reference to specific experiment, the invention will be further described for emulation.
Emulation experiment and analysis: the true application scenarios of main analog of the present invention, the performance for verifying E-Simhash algorithm are It is no more superior than traditional Si mhash algorithm.
Experimental situation and data set:
Experimental situation is deployed on a desktop computer, and machine parameter is as follows:
1 experimental situation parameter of table
The whole network news data of the data set in search dog laboratory 2012 editions, it is from more news sites nearly 20 The classified news of column rejects the data for being lower than 800 characters, and therefrom randomly selects 1565 progress subsequent experimentals.
First from 1565 news, according to modification ratio, randomly select several news modify, delete, shifting, The random operations such as replacement, and control modified article and original article has the similarity of certain threshold value T, generate sample to be tested Collection, is compared using traditional Si mhash with the algorithm in this patent, the index of correlation of statistical experiment later.
Analysis of experimental results
Four kinds of indexs are commonly used in experimental result to be assessed, and are duplicate removal rate, precision ratio, recall rate and F value respectively, wherein Duplicate removal rate refers to the ratio of classify correct sample number and total sample, with regard to being predicted as same source article collection number and total for this experiment The ratio of article number.
Below with reference to specific experiment, the invention will be further described.
Experiment one: the comparison of duplicate removal rate:
1162 are randomly selected in 1565 news and carries out any modification, chooses different Hamming distances, compare two kinds Accuracy rate in algorithm, T=15% in test, i.e. every news are kept for no more 15% modification, and fingerprint length is 128, word Weight threshold Wt=90, experimental result is illustrated in fig. 4 shown below.
The experimental results showed that E-Simhash algorithm all has very high duplicate removal rate when Hamming distances are greater than 2.In reality Hamming distances generally take 10 or so in, so the duplicate removal effect of E-Simhash algorithm is more preferable.
Experiment two: modification T threshold comparison:
The present invention tests the similarity threshold T of modification text, respectively under 5%, 10%, 15%, 20% modification, hamming Distance is selected as 10, i.e., thinks similar lower than 10, compare the duplicate removal rate of two kinds of algorithms.From experimental result it is as shown in Figure 5 in it is found that E-Simhash algorithm duplicate removal rate is better than respectively with 0.833:0.679,0.751:0.529,0.687:0.476,0.661:0.451 Traditional Simhash algorithm, and with the increase that article changes, downward trend is all presented in duplicate removal rate.The experimental results showed that At different modification threshold value T, E-Simhash algorithm is superior to traditional Simhash algorithm.
Experiment three: precision ratio, recall rate and the comparison of F value:
In an experiment, an article is randomly selected from news concentration to be modified at random, and guarantee there is 90% with original text Similarity compares precision ratio, recall ratio and F1 value based on Simhash fingerprint Yu E-Simhash algorithm.Wherein Hamming distances Choose 10;Experiment carries out 100 times, and takes their average value, as final result, as a result as shown in Figure 6.Pass through experimental data It is found that E-Simhash algorithm is in precision ratio 0.963:0.818, recall rate 0.867:0.621, F1 value 0.912:0.706 is better than biography The Simhash algorithm of system.The result shows that E-Simhash algorithm is in terms of precision ratio, recall rate and F value than common Simhash algorithm has greatly improved, and also suffices to show that the superiority of E-Simhash algorithm.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (9)

1. a kind of method of improved Simhash algorithm in text duplicate removal, which is characterized in that the improved Simhash is calculated Method of the method in text duplicate removal include: be weighted to obtain weight using based on TF-IDF algorithm and comentropy, and according to Distribution in document is ranked up, and carries out exclusive or with feature vocabulary position again to the hash that each feature vocabulary generates;
After improved weight calculation, weight threshold W is introducedt, increase text distributed intelligence, embody the fingerprint ultimately generated and close The specific gravity of key information, and the relevance of finger print information and weight is analyzed.
2. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that described to change Into method of the Simhash algorithm in text duplicate removal specifically include:
Step 1:, initialization:
Simhash digit and f dimensional vector space are determined to data set size and carrying cost, while initializing f binary systems Number S is set to 0;
Step 2, document pre-processes:
Document is segmented, stop words is gone to operate, constitutes several feature lexical item M={ } of document;
Step 3, weight calculation:
The TF-IDF value and left and right comentropy for calculating separately the characteristic item after participle, use the square mean of TF value, IDF value It is worth the weight final as characteristic item, and introduces threshold value WtFile characteristics are prevented to be distorted;
Step 4, hash is calculated:
Hash calculating is carried out to the characteristic item in step 2, and introduces location factor and hash progress xor operation as characteristic item Final hash value, hash value includes the location information of characteristic item, H={ } is denoted as, wherein=hash ();
Step 5, it adds up and following fortune is carried out to the position the f hash value generated in the Feature item weighting and step 4 generated in step 3 It calculates:
Step 6:, compressed transform: to the second level fingerprint vector V ultimately produced, conversion processing is carried out to each, ultimately generates text F hash fingerprint S of shelves;
3. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that in document Extracting n keyword is respectively { p1,p2,p3,…pn, the weight of each keyword is W={ w1,w2,w3,…,wn};N is closed Keyword generates hash value, and result is H={ h1,h2,h3,…,hn, second level fingerprint F={ f is generated after superposition1, f2, f3, ...fm, m is fingerprint digit, finally according to f in FiWhether being greater than 0 to generate Simhash fingerprint is S;
A certain Feature Words p if it existsk, weight is
wk> > wj, j ∈ [1, n] ∩ j ≠ k;
Then S is by pkIt determines.
4. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that introduce power After weight threshold value, weight calculation is shown below:
5. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that comentropy Are as follows:
H (X)=- ∑ (xi∈X)P(xi)log2P(xi)
Wherein X indicates information probability space X=(x1: P (x1), x2: P (x2) ..., xn: P (xn)), H (X) indicates stochastic variable X Probabilistic measurement.
6. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that comentropy Including left and right comentropy, formula is as follows:
In formula, W indicates some word, EL(W) the left entropy of the word is indicated, and P (aW | W) indicate occur different words on the left of the word Probability, a variable are a changing values, indicate the vocabulary combined with W;ERIt (W) is right entropy.
7. method of the improved Simhash algorithm in text duplicate removal as described in claim 1, which is characterized in that entropy weighting Calculating method includes:
Feature Words or so comentropy is averaged;Use Hk(w) the entropy information amount of the word is indicated;Entropy factor HkWeight is added In calculation formula, take the square mean number of the two as word weight, as follows:
Feature Words tkIn document djThe number of middle appearance is more, and the document for occurring the specific word in training set is fewer, and it is believed Breath amount is bigger, then its weight is higher.
8. a kind of implement the improved of method of the improved Simhash algorithm in text duplicate removal as described in claim 1 Control system of the Simhash algorithm in text duplicate removal.
9. a kind of implement the improved of method of the improved Simhash algorithm in text duplicate removal as described in claim 1 Removal redundant data storage medium of the Simhash algorithm in text duplicate removal.
CN201910225442.9A 2019-03-25 2019-03-25 Method and system for improved Simhash algorithm in text deduplication Active CN109948125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910225442.9A CN109948125B (en) 2019-03-25 2019-03-25 Method and system for improved Simhash algorithm in text deduplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910225442.9A CN109948125B (en) 2019-03-25 2019-03-25 Method and system for improved Simhash algorithm in text deduplication

Publications (2)

Publication Number Publication Date
CN109948125A true CN109948125A (en) 2019-06-28
CN109948125B CN109948125B (en) 2020-12-08

Family

ID=67011555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910225442.9A Active CN109948125B (en) 2019-03-25 2019-03-25 Method and system for improved Simhash algorithm in text deduplication

Country Status (1)

Country Link
CN (1) CN109948125B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442679A (en) * 2019-08-01 2019-11-12 信雅达系统工程股份有限公司 A kind of text De-weight method based on Fusion Model algorithm
CN110750731A (en) * 2019-09-27 2020-02-04 成都数联铭品科技有限公司 Duplicate removal method and system for news public sentiment
CN110837555A (en) * 2019-11-11 2020-02-25 苏州朗动网络科技有限公司 Method, equipment and storage medium for removing duplicate and screening of massive texts
WO2021086710A1 (en) * 2019-10-29 2021-05-06 EMC IP Holding Company LLC Capacity reduction in a storage system
CN113011194A (en) * 2021-04-15 2021-06-22 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113300830A (en) * 2021-05-25 2021-08-24 湖南遥昇通信技术有限公司 Data transmission method, device and storage medium based on weighted probability model
CN114201959A (en) * 2021-11-16 2022-03-18 湖南长泰工业科技有限公司 Mobile emergency command method
CN116932526A (en) * 2023-09-19 2023-10-24 天泽智慧科技(成都)有限公司 Text deduplication method for open source information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373751A1 (en) * 2017-06-21 2018-12-27 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for recognizing a low-quality news resource, computer device and readable medium
CN109241277A (en) * 2018-07-18 2019-01-18 北京航天云路有限公司 The method and system of text vector weighting based on news keyword

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373751A1 (en) * 2017-06-21 2018-12-27 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for recognizing a low-quality news resource, computer device and readable medium
CN109241277A (en) * 2018-07-18 2019-01-18 北京航天云路有限公司 The method and system of text vector weighting based on news keyword

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张保富,施化吉,马素琴: "基于TFIDF文本特征加权方法的改进研究", 《计算机应用与软件》 *
王诚,王宇成: "基于Simhash的大规模文档去重改进算法研究", 《计算机技术与发展》 *
邢恩军,赵富强: "基于上下文词频词汇量指标的新词发现方法", 《计算机应用与软件》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442679A (en) * 2019-08-01 2019-11-12 信雅达系统工程股份有限公司 A kind of text De-weight method based on Fusion Model algorithm
CN110750731A (en) * 2019-09-27 2020-02-04 成都数联铭品科技有限公司 Duplicate removal method and system for news public sentiment
CN110750731B (en) * 2019-09-27 2023-10-27 成都数联铭品科技有限公司 Method and system for removing duplicate of news public opinion
WO2021086710A1 (en) * 2019-10-29 2021-05-06 EMC IP Holding Company LLC Capacity reduction in a storage system
US11068208B2 (en) 2019-10-29 2021-07-20 EMC IP Holding Company LLC Capacity reduction in a storage system
CN110837555A (en) * 2019-11-11 2020-02-25 苏州朗动网络科技有限公司 Method, equipment and storage medium for removing duplicate and screening of massive texts
CN113011194A (en) * 2021-04-15 2021-06-22 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113300830A (en) * 2021-05-25 2021-08-24 湖南遥昇通信技术有限公司 Data transmission method, device and storage medium based on weighted probability model
CN113300830B (en) * 2021-05-25 2022-05-27 湖南遥昇通信技术有限公司 Data transmission method, device and storage medium based on weighted probability model
CN114201959A (en) * 2021-11-16 2022-03-18 湖南长泰工业科技有限公司 Mobile emergency command method
CN116932526A (en) * 2023-09-19 2023-10-24 天泽智慧科技(成都)有限公司 Text deduplication method for open source information
CN116932526B (en) * 2023-09-19 2023-11-24 天泽智慧科技(成都)有限公司 Text deduplication method for open source information

Also Published As

Publication number Publication date
CN109948125B (en) 2020-12-08

Similar Documents

Publication Publication Date Title
CN109948125A (en) Method and system of the improved Simhash algorithm in text duplicate removal
US11727243B2 (en) Knowledge-graph-embedding-based question answering
US10346257B2 (en) Method and device for deduplicating web page
Tolias et al. Visual query expansion with or without geometry: refining local descriptors by feature aggregation
Chen et al. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
US20100088295A1 (en) Co-location visual pattern mining for near-duplicate image retrieval
Zheng et al. Fast image retrieval: Query pruning and early termination
Mariello et al. Feature selection based on the neighborhood entropy
CN103617157A (en) Text similarity calculation method based on semantics
Zhou et al. Online video recommendation in sharing community
Sikdar et al. MODE: multiobjective differential evolution for feature selection and classifier ensemble
Mi et al. Face recognition using sparse representation-based classification on k-nearest subspace
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
Luckner et al. Stable web spam detection using features based on lexical items
US11860953B2 (en) Apparatus and methods for updating a user profile based on a user file
CN105183792B (en) Distributed fast text classification method based on locality sensitive hashing
Sarwar et al. An effective and scalable framework for authorship attribution query processing
Huang et al. Two efficient hashing schemes for high-dimensional furthest neighbor search
Chiang et al. The Chinese text categorization system with association rule and category priority
Zahedi et al. Improving text classification performance using PCA and recall-precision criteria
Abbasi Intelligent feature selection for opinion classification
Caragea et al. Combining hashing and abstraction in sparse high dimensional feature spaces
JP6017277B2 (en) Program, apparatus and method for calculating similarity between contents represented by set of feature vectors
Tschuggnall et al. Reduce & attribute: Two-step authorship attribution for large-scale problems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230707

Address after: 518000 room 321, building 2, Nanke Chuangyuan Valley, Taoyuan community, Dalang street, Longhua District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen lizhuan Technology Transfer Center Co.,Ltd.

Address before: 610225, No. 24, Section 1, Xuefu Road, Southwest Economic Development Zone, Chengdu, Sichuan

Patentee before: CHENGDU University OF INFORMATION TECHNOLOGY

Effective date of registration: 20230707

Address after: 201400 Floor 2, No. 2900, Nanxinggang Road, Fengxian District, Shanghai

Patentee after: Shanghai Yiju Technology Co.,Ltd.

Address before: 518000 room 321, building 2, Nanke Chuangyuan Valley, Taoyuan community, Dalang street, Longhua District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen lizhuan Technology Transfer Center Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231226

Address after: No. 135, Hongtian Junyue Zhongchuang Space, 1st Floor, Unit 9A, Bund Center, No. 88 Duhai Road, Binhai Street, Longhua District, Haikou City, Hainan Province, 570100

Patentee after: Haikou Lingjie Information Technology Co.,Ltd.

Address before: 201400 Floor 2, No. 2900, Nanxinggang Road, Fengxian District, Shanghai

Patentee before: Shanghai Yiju Technology Co.,Ltd.