CN104392002B

CN104392002B - A kind of the approximate of extensive collections of web pages repeats lookup method

Info

Publication number: CN104392002B
Application number: CN201410779353.6A
Authority: CN
Inventors: 张鹏; 熊翠文; 刘庆云; 杨嵘; 郑超; 刘俊朋; 李舒
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2014-12-15
Filing date: 2014-12-15
Publication date: 2017-09-26
Anticipated expiration: 2034-12-15
Also published as: CN104392002A

Abstract

A kind of method for approximately repeating to search the present invention relates to extensive collections of web pages, carry out filtering web page content noise using the point signature of document, approximate repetition is completed with reference to subregion and inverted index beta pruning to search, so that approximately repeating search efficiency height, the Jaccard similarities for only calculating point signature make it that the complexity of method is very low.

Description

A kind of the approximate of extensive collections of web pages repeats lookup method

Technical field

The present invention relates to information retrieval technique, core content approximately repeats to search in particularly a kind of extensive collections of web pages Method.

Background technology

With the rapid development of Internet, the data scale of interconnection online storage constantly expands, want quickly to be wanted Data, search engine turn into main means.Although search engine eliminates repeated pages in search result is returned, Approximate repeated pages are continuously emerged in corpus, to obtain more accurate valuable as a result, it is desirable to search these approximate repetitions Webpage, and removed.However, traditional approximate lookup method that repeats has many weak points, such as shingling methods Too high based on the overlapping complexities of shingles or n-gram subsets calculating Jaccard, fingerprint pattern method is based on document offset To select bit string to bring very big overhead, multiple signatures are mapped single cryptographic Hash and add signature by local sensitivity hash method Extract the overhead with Hash.

The content of the invention

The technology of the present invention solves problem：The deficiencies in the prior art are overcome approximately to be repeated there is provided a kind of extensive collections of web pages The method of lookup, carrys out filtering web page content noise using the point signature of document, completes near with reference to subregion and inverted index beta pruning Searched like repetition so that approximate repetition search efficiency is high, the Jaccard similarities of only calculating point signature cause the complexity of method It is very low.

The technology of the present invention solution：A kind of the approximate of extensive collections of web pages repeats lookup method, process such as Fig. 1 institutes Show：

In step 0, start to perform the present invention, turn to step 1 and perform；

In step 1, input carries the document vector of weighted point signature, with border [p_k,p_k+1) subregion k and fall arrange Table (now Inverted List is sky), turns to step 2 and performs；[p_k,p_k+1) be Document Length scope, be half-open half closed zone Between, p_kFor interval lower boundary, p_k+1For interval coboundary；

In step 2, that deposits result is initialized as sky to collection, turns to step 3 and performs；

In step 3, judge in the randomly ordered sequence of Document Length with the presence or absence of the document vector d not being calculated_i, have Then turn to step 4 to perform, otherwise turn to step 22 and perform；

In step 4, subregion k interval lower boundary p_kDocument vector d is set_iLength, turn to step 5 perform；

In step 5, by d in subregion k_iAll point signatures do an ascending order arrangement according to frequency, by point signature frequency most Small point signature is placed in first, turns to step 6 and performs；

In step 6, by the first assisted border delta₁0 is set to, will detect that document vector set is set to sky, turned to step 7 Perform；

In step 7, d is judged_iPoint signature with the presence or absence of do not calculated point signature s_ij, wherein, j is point signature in d_i In position, if in the presence of, turn to step 8 perform, otherwise turn to step 20 perform；

In step 8, with s in subregion k_ijRelevant documentation vector inserts Inverted List list_kjIn, and by Document Length descending Arrangement, turns to step 9 and performs；

In step 9, by the second assisted border delta₂0 is set to, step 10 is turned to and performs；

In step 10, list is judged_kjIn whether there is by Document Length descending arrange document vector d_i', if in the presence of, Turn to step 11 to perform, otherwise, turn to step 18 and perform；

In step 11, delta₂It is set to document vector d_iAnd d_i’The difference of length, turns to step 12 and performs；

In step 12, document vector d is judged_iAnd d_i’Whether the vectorial d of equal or document_i’It has been be tested that, if meeting, Turn to step 10 to perform, otherwise turn to step 13 and perform；

In step 13, delta is judged₂Less than 0 and delta₁-delta₂More than d_i’1-t times of length, if meeting, is turned to Step 10 is performed, and is otherwise turned to step 14 and is performed；

In step 14, delta is judged₂More than or equal to 0 and delta₁+delta₂More than d_i1-t times of length, if meeting, Turn to step 18 to perform, otherwise turn to step 15 and perform；

In step 15, document vector d is judged_iAnd d_i’Jaccard similarities be more than or equal to t, if meet, turn to step 16 perform, and otherwise turn to step 10 and perform；

, will in step 16<d_i,d_i’>It is added to result to concentration, turns to step 17 and perform；<d_i,d_i’>By document vector d_i And d_i' composition document pair, document vector d_iAnd d_i’Jaccard similarities be more than or equal to threshold value when then match, with<d_i,d_i’> Form is added to collection, is otherwise added without to collection.

In step 17, by d_i’It is added to and has detected document vector set, turns to step 10 and perform；

In step 18, delta₁Value increase document vector d_iSign s at midpoint_ijFrequency, turn to step 19 perform；

In step 19, delta is judged₁More than 1-t times of the maximum length that document vector is not detected among in subregion k, if Meet, then turn to step 20 and perform, otherwise turn to step 7 and perform；

In step 20, the subregion upper bound and document vector d are judged_iThe difference of length be less than or equal to 1-t times of the subregion upper bound, if Meet, then turn to step 21 and perform, otherwise turn to step 3 and perform；T is the critical value ratio of Jaccard similarities, value according to Depending on actual conditions, typically between 0.5 to 1；

In step 21, by interval coboundary p_k+1It is set to interval lower boundary p_k, iteration to next subregion, turn to step 6 hold OK；

In step 22, returning result turns to step 23 and performed to collection；

In step 23, terminate whole program.

The advantage of the present invention compared with prior art is：

(1) existing approximate repeated pages detection method mainly has shingling algorithms, fingerprint pattern, local sensitivity to breathe out It is uncommon three kinds.Shingling algorithms are overlapping based on shingles or n-grams calculating Jaccard, and its complexity is too high, in this base The improvement of " super shingles " is introduced on plinth can only also reduce the precision of result by a small margin；Fingerprint pattern is extracted to be had in document The cryptographic Hash connection position of the word of representative or whole sentence concatenates into fingerprint characteristic to judge document similarity, but bit string is selected Produce very big overhead；The connection of local sensitivity Hash, to single cryptographic Hash, passes through independent Kazakhstan from each data object Uncommon function, adds additional the expense of signature extraction and multiple Hash.

(2) short chain that the present invention is connected into using deactivation antecedent with adjacent content item is as document signature, and signature is extracted Speed is fast, expense is small；When calculating Jaccard similarities, computer capacity is reduced with reference to set partition, it is to avoid calculate substantially dissimilar Document vector between similarity, introduce inverted index beta pruning eliminate redundant computation process accelerate calculation procedure.The present invention Method can reach accurately and fast, the low purpose of computation complexity.

Brief description of the drawings

Fig. 1 is the inventive method implementation process figure；

Fig. 2 is that the signature approximate lookup method that repeats in midpoint of the present invention performs figure.

Embodiment

Before describing the present invention, explanation and illustration once first is carried out to related notion.

Point (p)：The word often occurred in document, referred to as stop words in natural language text, such as is, the, do, Have etc..

Popular word：It is not the word of point in document.

Point is away from (d)：In document between popular word and point selected before it or popular word and the generic word before it The vocabulary number being spaced between remittance, puts not in this counting, such as " a rally to kick ", and vocabulary " a " and " to " are in a document All often occur, can serve as a little, the point between " rally " and " a " away from for 1, point between " kick " and " a " away from for 2, point between " kick " and " rally " is away from for 1 (because " to " as point not in counting), the point between " kick " and " to " Away from for 1.

Chain length (c)：In defined document relative to some point meet point away from popular word number, such as " a rally To kick ", relative to point " a " point away from the chain a length of 2 for 1.

Point signature (s)：Popular word chained list for searching an a length of c of continuous chain, popular word in chained list it Between at intervals of point away from d, wherein at intervals of d between first in chained list popular word and point, representation is p (d, c), i.e., Point signature=point (point away from：Chain length), such as " a rally to kick ", it is point to select " a ", and point is 1 away from d, and chain length c is 2, point S=" a " (1,2) value of signing is { " a ":"rally":"kick"}.

Point signature frequency：The number of times that point signature occurs in a document, such as " a rally to kick off a Weeklong campaign ", the number of times that point signature s=" a " (1,2) occurs is that 2, i.e. frequency are 2.

Point signature weight：Put signature frequency.

Document Length (| d |)：All point signature frequency sums in document, such as " a rally to kick off a Weeklong campaign ", point signature s₁The number of times that=" a " (1,2) occurs is 2, point signature s₂Time that=" to " (1,2) occurs Number is 1, and Document Length is 2+1=3.

Document vector (d)：The sequence that document representation is signed into the point with frequency, i.e. document vector={ point signature 1：Point Sign 1 frequency, point signature 2：2 frequencies ... of point signature }, such as d=" a rally to kick off a weeklong Campaign ", document vector representation is into d={ s₁:2,s₂:1 }, wherein s₁=" a " (1,2), s₂=" to " (1,2), Document Length | d |=3.

Subregion k：The scope of Document Length size, is a semi-closure half open interval, and interval lower bound is fixed, and the upper bound is not solid It is fixed.[p_k,p_k+1) be Document Length scope, be a half-open intervals, p_kFor interval lower boundary, p_k+1For interval coboundary.

Inverted List (list)：With identical point sign document Vector Groups into list, i.e. Inverted List=document vector 1, document vector 2 ... }, each document vector carries the signature frequency, such as s in list₁=" a " (1,2), s₂=" to " (1,2), s₃=" the " (2,1), d₁=" a rally to kick off a weeklong campaign ", d₁={ s₁:2, s₂:1 }, d₂=" a biological argument for the cougar phenomenon ", d₂={ s₁:1,s₃:1 }, d₃ =" the past can be corrected ", d₃={ s₃:1 }, list₁：{d₁:2,d₂:1 } (point signature s₁=" a " (1,2) is right The Inverted List answered), list₂：{d₁:1 } (point signature s₂The corresponding Inverted Lists of=" to " (1,2)), list₃：{d₂:1,d₃:1} (point signature s₃The corresponding Inverted Lists of=" the " (2,1)).

Jaccard similarities sim：The smaller value sum divided by two documents of corresponding signature frequency in two documents vector The higher value sum of corresponding signature frequency, sim (d in vector₁,d₂)=(1+0+0)/(2+1+1)=0.25.

Threshold value (t)：The critical value ratio of Jaccard similarities, depending on value is according to actual conditions, typically 0.5 to 1 it Between.

Assisted border (delta)：Variable for blotter numerical value.

To collection<d_i,d_i’>}：I.e. document is to set, two documents vector d_iAnd d_i’Jaccard similarities be more than or equal to threshold Then matched during value, two documents vector with "<Document vector 1, document vector 2>", i.e., "<d_i,d_i’>", form is added to concentrating, no Then mismatch, be added without to collection.

The present invention will be described in detail below in conjunction with above-mentioned.

The execution figure of the approximate duplicate removal of collections of web pages is as shown in Figure 2.The web document vector calculated in Fig. 2 is d₁={ s₁:5, s₂:4,s₃:4}、d₂={ s₁:8,s₂:4 } and d₃={ s₁:4,s₂:5,s₃:5 }, document vector d₁、d₂And d₃Length be respectively 13, 12 and 14, threshold value t are set to 0.8 implementation procedure and are described as follows：

(1) the document vector d that input is signed with weighted point₁、d₂And d₃, input border is [p_k,p_k+1) subregion and fall Permutation table；

(2) empty is set to collection, document vector presses length random alignment, randomly chooses d₁Compared with other document vectors；

(3) subregion lower boundary is set to document vector d₁Length 13, point signature s₁、s₂And s₃Frequency be respectively 17, 13rd, 9, point signature does an ascending sort according to the size of frequency, and it is s that selection, which calculates the minimum point signature of point signature frequency,₃Under Lists of documents, assisted border delta₁0 is set to, has detected that document vector set is set to empty；

(4) the document vector under each point signature is arranged according to Document Length descending, first traversal point signature s₃Text Shelves vector set is put into Inverted List, detection document vector d₃, assisted border delta₂=d₁Length-d₃Length=- 1<0, Document vector d₁With d₃Unequal and d₃It is not detected, delta₁-delta₂(=1) it is less than (1-0.8) * 14 (=2.8), delta₁ +delta₂(=- 1) it is less than (1-0.8) * 13 (=2.6), so calculating document vector d₁With a signature s₃In document vector d₃ Jaccard similarity sim (d₁,d₃)=(4+4+4)/(5+5+5)=0.8 >=t, will<d₁,d₃>Add to collection；

(5) test point signature s₃Inverted List in document vector d₁, document vector d₁With d₁It is equal, do not calculate, delta₁ Increase document vector d₁Point signature s₃Frequency, delta₁=4, and terminate this traversal, second point signature s of traversal₁Text Shelves vector set, delta₁0 is set to, by s₁Document vector set be put into list, detection document vector d₂, assisted border delta₂ =d₁Length-d₂Length=1>0, document vector d₁With d₂Unequal and d₂It is not detected, delta₁-delta₂(=- 1) it is small In (1-0.8) * 12 (=2.4), delta₁+delta₂(=1) it is less than (1-0.8) * 13 (=2.6), so calculating document vector d₁ With a signature s₃In document vector Jaccard similarities, sim (d1, d2)=(5+4)/(8+4+4)=0.5625<T,<d₁, d₃>It is added without to collection；

(6) continue to travel through Inverted List, document vector d in list₁With d₁It is equal, document vector d₃It has been detected that, delta₁Increase Plus document vector d₁Point signature s₁Frequency, delta₁=5, terminate this traversal, the document sets of the 3rd point signature of traversal s₂, delta₁It is set to document vector d in 0, list₁With d₁It is equal, document vector d₃、d₂It has been detected that, terminate this traversal, return pair Collection<d₁,d₃>, terminate program.

Above example is provided just for the sake of the description purpose of the present invention, and is not intended to limit the scope of the present invention.This The scope of invention is defined by the following claims.The various equivalent substitutions that do not depart from spirit and principles of the present invention and make and repair Change, all should cover within the scope of the present invention.

Claims

1. a kind of extensive collections of web pages approximately repeats the method searched, it is characterised in that realize that step is as follows：

Step 1, input carries the document vector of weighted point signature, with border [p_k,p_k+1) subregion k and Inverted List, turn to Step 2 is performed；[p_k,p_k+1) be Document Length scope, be a half-open intervals, p_kFor interval lower boundary, p_k+1For interval Coboundary；

Step 2, that deposits result is initialized as sky to collection, turns to step 3 and performs；In step 3, judge that Document Length is randomly ordered With the presence or absence of the document vector d not being calculated in sequence_i, exist, turn to step 4 and perform, otherwise turn to step 22 and hold OK；

In step 5, by document vector d in subregion k_iAll point signatures do an ascending order arrangement according to frequency, will put signature frequency Minimum point signature is placed in first, turns to step 6 and performs；

In step 6, the first assisted border delta₁0 is set to, will detect that document vector set is set to sky, step 7 is turned to and performs；

In step 7, document vector d is judged_iPoint signature with the presence or absence of do not calculated point signature s_ij, wherein, j signs for point In document vector d_iIn position, if in the presence of, turn to step 8 perform, otherwise turn to step 20 perform；

In step 8, with a signature s in subregion k_ijRelevant documentation vector inserts Inverted List list_kjIn, and by Document Length descending Arrangement, turns to step 9 and performs；

In step 10, Inverted List list is judged_kjIn whether there is by Document Length descending arrange document vector d_i’If depositing Then turning to step 11 and performing, otherwise, turning to step 18 and perform；

In step 11, by the second assisted border delta₂It is set to document vector d_iAnd d_i’The difference of length, turns to step 12 and performs；

In step 12, document vector d is judged_iAnd d_i’Whether the vectorial d of equal or document_i’It has been be tested that, if meeting, turned to Step 10 is performed, and is otherwise turned to step 13 and is performed；

In step 13, delta is judged₂Less than 0 and delta₁-delta₂More than d_i’1-t times of length, if meeting, turns to step 10 perform, and otherwise turn to step 14 and perform；

In step 14, the second assisted border delta is judged₂More than or equal to 0 and delta₁+delta₂More than d_i1-t times of length, if Meet, then turn to step 18 and perform, otherwise turn to step 15 and perform；

In step 15, document vector d is judged_iAnd d_i’Jaccard similarities be more than or equal to t, if meet, turn to step 16 hold OK, step 10 is otherwise turned to perform；

, will in step 16<d_i,d_i’>It is added to result to concentration, turns to step 17 and perform；<d_i,d_i’>By document vector d_iAnd d_i’ The document pair of composition, document vector d_iAnd d_i’Jaccard similarities be more than or equal to threshold value when then match, with<d_i,d_i’>Form Add to collection, be otherwise added without to collection；

In step 19, delta is judged₁More than 1-t times of the maximum length that document vector is not detected among in subregion k, if meeting, Then turn to step 20 to perform, otherwise turn to step 7 and perform；

In step 20, the subregion upper bound and document vector d are judged_iThe difference of length be less than or equal to 1-t times of the subregion upper bound, if satisfaction, Then turn to step 21 to perform, otherwise turn to step 3 and perform；T is the critical value ratio of Jaccard similarities, and value is according to reality Depending on situation, span is 0.5-1；

In step 21, by interval coboundary p_k+1It is set to interval lower boundary p_k, iteration to next subregion, turn to step 6 perform；

In step 22, returning result turns to step 23 and performed to collection；

In step 23, terminate whole program.