CN104392002B - A kind of the approximate of extensive collections of web pages repeats lookup method - Google Patents
A kind of the approximate of extensive collections of web pages repeats lookup method Download PDFInfo
- Publication number
- CN104392002B CN104392002B CN201410779353.6A CN201410779353A CN104392002B CN 104392002 B CN104392002 B CN 104392002B CN 201410779353 A CN201410779353 A CN 201410779353A CN 104392002 B CN104392002 B CN 104392002B
- Authority
- CN
- China
- Prior art keywords
- perform
- turn
- document
- document vector
- delta
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of method for approximately repeating to search the present invention relates to extensive collections of web pages, carry out filtering web page content noise using the point signature of document, approximate repetition is completed with reference to subregion and inverted index beta pruning to search, so that approximately repeating search efficiency height, the Jaccard similarities for only calculating point signature make it that the complexity of method is very low.
Description
Technical field
The present invention relates to information retrieval technique, core content approximately repeats to search in particularly a kind of extensive collections of web pages
Method.
Background technology
With the rapid development of Internet, the data scale of interconnection online storage constantly expands, want quickly to be wanted
Data, search engine turn into main means.Although search engine eliminates repeated pages in search result is returned,
Approximate repeated pages are continuously emerged in corpus, to obtain more accurate valuable as a result, it is desirable to search these approximate repetitions
Webpage, and removed.However, traditional approximate lookup method that repeats has many weak points, such as shingling methods
Too high based on the overlapping complexities of shingles or n-gram subsets calculating Jaccard, fingerprint pattern method is based on document offset
To select bit string to bring very big overhead, multiple signatures are mapped single cryptographic Hash and add signature by local sensitivity hash method
Extract the overhead with Hash.
The content of the invention
The technology of the present invention solves problem:The deficiencies in the prior art are overcome approximately to be repeated there is provided a kind of extensive collections of web pages
The method of lookup, carrys out filtering web page content noise using the point signature of document, completes near with reference to subregion and inverted index beta pruning
Searched like repetition so that approximate repetition search efficiency is high, the Jaccard similarities of only calculating point signature cause the complexity of method
It is very low.
The technology of the present invention solution:A kind of the approximate of extensive collections of web pages repeats lookup method, process such as Fig. 1 institutes
Show:
In step 0, start to perform the present invention, turn to step 1 and perform;
In step 1, input carries the document vector of weighted point signature, with border [pk,pk+1) subregion k and fall arrange
Table (now Inverted List is sky), turns to step 2 and performs;[pk,pk+1) be Document Length scope, be half-open half closed zone
Between, pkFor interval lower boundary, pk+1For interval coboundary;
In step 2, that deposits result is initialized as sky to collection, turns to step 3 and performs;
In step 3, judge in the randomly ordered sequence of Document Length with the presence or absence of the document vector d not being calculatedi, have
Then turn to step 4 to perform, otherwise turn to step 22 and perform;
In step 4, subregion k interval lower boundary pkDocument vector d is setiLength, turn to step 5 perform;
In step 5, by d in subregion kiAll point signatures do an ascending order arrangement according to frequency, by point signature frequency most
Small point signature is placed in first, turns to step 6 and performs;
In step 6, by the first assisted border delta10 is set to, will detect that document vector set is set to sky, turned to step 7
Perform;
In step 7, d is judgediPoint signature with the presence or absence of do not calculated point signature sij, wherein, j is point signature in di
In position, if in the presence of, turn to step 8 perform, otherwise turn to step 20 perform;
In step 8, with s in subregion kijRelevant documentation vector inserts Inverted List listkjIn, and by Document Length descending
Arrangement, turns to step 9 and performs;
In step 9, by the second assisted border delta20 is set to, step 10 is turned to and performs;
In step 10, list is judgedkjIn whether there is by Document Length descending arrange document vector di', if in the presence of,
Turn to step 11 to perform, otherwise, turn to step 18 and perform;
In step 11, delta2It is set to document vector diAnd di’The difference of length, turns to step 12 and performs;
In step 12, document vector d is judgediAnd di’Whether the vectorial d of equal or documenti’It has been be tested that, if meeting,
Turn to step 10 to perform, otherwise turn to step 13 and perform;
In step 13, delta is judged2Less than 0 and delta1-delta2More than di’1-t times of length, if meeting, is turned to
Step 10 is performed, and is otherwise turned to step 14 and is performed;
In step 14, delta is judged2More than or equal to 0 and delta1+delta2More than di1-t times of length, if meeting,
Turn to step 18 to perform, otherwise turn to step 15 and perform;
In step 15, document vector d is judgediAnd di’Jaccard similarities be more than or equal to t, if meet, turn to step
16 perform, and otherwise turn to step 10 and perform;
, will in step 16<di,di’>It is added to result to concentration, turns to step 17 and perform;<di,di’>By document vector di
And di' composition document pair, document vector diAnd di’Jaccard similarities be more than or equal to threshold value when then match, with<di,di’>
Form is added to collection, is otherwise added without to collection.
In step 17, by di’It is added to and has detected document vector set, turns to step 10 and perform;
In step 18, delta1Value increase document vector diSign s at midpointijFrequency, turn to step 19 perform;
In step 19, delta is judged1More than 1-t times of the maximum length that document vector is not detected among in subregion k, if
Meet, then turn to step 20 and perform, otherwise turn to step 7 and perform;
In step 20, the subregion upper bound and document vector d are judgediThe difference of length be less than or equal to 1-t times of the subregion upper bound, if
Meet, then turn to step 21 and perform, otherwise turn to step 3 and perform;T is the critical value ratio of Jaccard similarities, value according to
Depending on actual conditions, typically between 0.5 to 1;
In step 21, by interval coboundary pk+1It is set to interval lower boundary pk, iteration to next subregion, turn to step 6 hold
OK;
In step 22, returning result turns to step 23 and performed to collection;
In step 23, terminate whole program.
The advantage of the present invention compared with prior art is:
(1) existing approximate repeated pages detection method mainly has shingling algorithms, fingerprint pattern, local sensitivity to breathe out
It is uncommon three kinds.Shingling algorithms are overlapping based on shingles or n-grams calculating Jaccard, and its complexity is too high, in this base
The improvement of " super shingles " is introduced on plinth can only also reduce the precision of result by a small margin;Fingerprint pattern is extracted to be had in document
The cryptographic Hash connection position of the word of representative or whole sentence concatenates into fingerprint characteristic to judge document similarity, but bit string is selected
Produce very big overhead;The connection of local sensitivity Hash, to single cryptographic Hash, passes through independent Kazakhstan from each data object
Uncommon function, adds additional the expense of signature extraction and multiple Hash.
(2) short chain that the present invention is connected into using deactivation antecedent with adjacent content item is as document signature, and signature is extracted
Speed is fast, expense is small;When calculating Jaccard similarities, computer capacity is reduced with reference to set partition, it is to avoid calculate substantially dissimilar
Document vector between similarity, introduce inverted index beta pruning eliminate redundant computation process accelerate calculation procedure.The present invention
Method can reach accurately and fast, the low purpose of computation complexity.
Brief description of the drawings
Fig. 1 is the inventive method implementation process figure;
Fig. 2 is that the signature approximate lookup method that repeats in midpoint of the present invention performs figure.
Embodiment
Before describing the present invention, explanation and illustration once first is carried out to related notion.
Point (p):The word often occurred in document, referred to as stop words in natural language text, such as is, the, do,
Have etc..
Popular word:It is not the word of point in document.
Point is away from (d):In document between popular word and point selected before it or popular word and the generic word before it
The vocabulary number being spaced between remittance, puts not in this counting, such as " a rally to kick ", and vocabulary " a " and " to " are in a document
All often occur, can serve as a little, the point between " rally " and " a " away from for 1, point between " kick " and " a " away from for
2, point between " kick " and " rally " is away from for 1 (because " to " as point not in counting), the point between " kick " and " to "
Away from for 1.
Chain length (c):In defined document relative to some point meet point away from popular word number, such as " a rally
To kick ", relative to point " a " point away from the chain a length of 2 for 1.
Point signature (s):Popular word chained list for searching an a length of c of continuous chain, popular word in chained list it
Between at intervals of point away from d, wherein at intervals of d between first in chained list popular word and point, representation is p (d, c), i.e.,
Point signature=point (point away from:Chain length), such as " a rally to kick ", it is point to select " a ", and point is 1 away from d, and chain length c is 2, point
S=" a " (1,2) value of signing is { " a ":"rally":"kick"}.
Point signature frequency:The number of times that point signature occurs in a document, such as " a rally to kick off a
Weeklong campaign ", the number of times that point signature s=" a " (1,2) occurs is that 2, i.e. frequency are 2.
Point signature weight:Put signature frequency.
Document Length (| d |):All point signature frequency sums in document, such as " a rally to kick off a
Weeklong campaign ", point signature s1The number of times that=" a " (1,2) occurs is 2, point signature s2Time that=" to " (1,2) occurs
Number is 1, and Document Length is 2+1=3.
Document vector (d):The sequence that document representation is signed into the point with frequency, i.e. document vector={ point signature 1:Point
Sign 1 frequency, point signature 2:2 frequencies ... of point signature }, such as d=" a rally to kick off a weeklong
Campaign ", document vector representation is into d={ s1:2,s2:1 }, wherein s1=" a " (1,2), s2=" to " (1,2), Document Length
| d |=3.
Subregion k:The scope of Document Length size, is a semi-closure half open interval, and interval lower bound is fixed, and the upper bound is not solid
It is fixed.[pk,pk+1) be Document Length scope, be a half-open intervals, pkFor interval lower boundary, pk+1For interval coboundary.
Inverted List (list):With identical point sign document Vector Groups into list, i.e. Inverted List=document vector
1, document vector 2 ... }, each document vector carries the signature frequency, such as s in list1=" a " (1,2), s2=" to "
(1,2), s3=" the " (2,1), d1=" a rally to kick off a weeklong campaign ", d1={ s1:2,
s2:1 }, d2=" a biological argument for the cougar phenomenon ", d2={ s1:1,s3:1 }, d3
=" the past can be corrected ", d3={ s3:1 }, list1:{d1:2,d2:1 } (point signature s1=" a " (1,2) is right
The Inverted List answered), list2:{d1:1 } (point signature s2The corresponding Inverted Lists of=" to " (1,2)), list3:{d2:1,d3:1}
(point signature s3The corresponding Inverted Lists of=" the " (2,1)).
Jaccard similarities sim:The smaller value sum divided by two documents of corresponding signature frequency in two documents vector
The higher value sum of corresponding signature frequency, sim (d in vector1,d2)=(1+0+0)/(2+1+1)=0.25.
Threshold value (t):The critical value ratio of Jaccard similarities, depending on value is according to actual conditions, typically 0.5 to 1 it
Between.
Assisted border (delta):Variable for blotter numerical value.
To collection<di,di’>}:I.e. document is to set, two documents vector diAnd di’Jaccard similarities be more than or equal to threshold
Then matched during value, two documents vector with "<Document vector 1, document vector 2>", i.e., "<di,di’>", form is added to concentrating, no
Then mismatch, be added without to collection.
The present invention will be described in detail below in conjunction with above-mentioned.
The execution figure of the approximate duplicate removal of collections of web pages is as shown in Figure 2.The web document vector calculated in Fig. 2 is d1={ s1:5,
s2:4,s3:4}、d2={ s1:8,s2:4 } and d3={ s1:4,s2:5,s3:5 }, document vector d1、d2And d3Length be respectively 13,
12 and 14, threshold value t are set to 0.8 implementation procedure and are described as follows:
(1) the document vector d that input is signed with weighted point1、d2And d3, input border is [pk,pk+1) subregion and fall
Permutation table;
(2) empty is set to collection, document vector presses length random alignment, randomly chooses d1Compared with other document vectors;
(3) subregion lower boundary is set to document vector d1Length 13, point signature s1、s2And s3Frequency be respectively 17,
13rd, 9, point signature does an ascending sort according to the size of frequency, and it is s that selection, which calculates the minimum point signature of point signature frequency,3Under
Lists of documents, assisted border delta10 is set to, has detected that document vector set is set to empty;
(4) the document vector under each point signature is arranged according to Document Length descending, first traversal point signature s3Text
Shelves vector set is put into Inverted List, detection document vector d3, assisted border delta2=d1Length-d3Length=- 1<0,
Document vector d1With d3Unequal and d3It is not detected, delta1-delta2(=1) it is less than (1-0.8) * 14 (=2.8), delta1
+delta2(=- 1) it is less than (1-0.8) * 13 (=2.6), so calculating document vector d1With a signature s3In document vector d3
Jaccard similarity sim (d1,d3)=(4+4+4)/(5+5+5)=0.8 >=t, will<d1,d3>Add to collection;
(5) test point signature s3Inverted List in document vector d1, document vector d1With d1It is equal, do not calculate, delta1
Increase document vector d1Point signature s3Frequency, delta1=4, and terminate this traversal, second point signature s of traversal1Text
Shelves vector set, delta10 is set to, by s1Document vector set be put into list, detection document vector d2, assisted border delta2
=d1Length-d2Length=1>0, document vector d1With d2Unequal and d2It is not detected, delta1-delta2(=- 1) it is small
In (1-0.8) * 12 (=2.4), delta1+delta2(=1) it is less than (1-0.8) * 13 (=2.6), so calculating document vector d1
With a signature s3In document vector Jaccard similarities, sim (d1, d2)=(5+4)/(8+4+4)=0.5625<T,<d1,
d3>It is added without to collection;
(6) continue to travel through Inverted List, document vector d in list1With d1It is equal, document vector d3It has been detected that, delta1Increase
Plus document vector d1Point signature s1Frequency, delta1=5, terminate this traversal, the document sets of the 3rd point signature of traversal
s2, delta1It is set to document vector d in 0, list1With d1It is equal, document vector d3、d2It has been detected that, terminate this traversal, return pair
Collection<d1,d3>, terminate program.
Above example is provided just for the sake of the description purpose of the present invention, and is not intended to limit the scope of the present invention.This
The scope of invention is defined by the following claims.The various equivalent substitutions that do not depart from spirit and principles of the present invention and make and repair
Change, all should cover within the scope of the present invention.
Claims (1)
1. a kind of extensive collections of web pages approximately repeats the method searched, it is characterised in that realize that step is as follows:
Step 1, input carries the document vector of weighted point signature, with border [pk,pk+1) subregion k and Inverted List, turn to
Step 2 is performed;[pk,pk+1) be Document Length scope, be a half-open intervals, pkFor interval lower boundary, pk+1For interval
Coboundary;
Step 2, that deposits result is initialized as sky to collection, turns to step 3 and performs;In step 3, judge that Document Length is randomly ordered
With the presence or absence of the document vector d not being calculated in sequencei, exist, turn to step 4 and perform, otherwise turn to step 22 and hold
OK;
In step 4, subregion k interval lower boundary pkDocument vector d is setiLength, turn to step 5 perform;
In step 5, by document vector d in subregion kiAll point signatures do an ascending order arrangement according to frequency, will put signature frequency
Minimum point signature is placed in first, turns to step 6 and performs;
In step 6, the first assisted border delta10 is set to, will detect that document vector set is set to sky, step 7 is turned to and performs;
In step 7, document vector d is judgediPoint signature with the presence or absence of do not calculated point signature sij, wherein, j signs for point
In document vector diIn position, if in the presence of, turn to step 8 perform, otherwise turn to step 20 perform;
In step 8, with a signature s in subregion kijRelevant documentation vector inserts Inverted List listkjIn, and by Document Length descending
Arrangement, turns to step 9 and performs;
In step 9, by the second assisted border delta20 is set to, step 10 is turned to and performs;
In step 10, Inverted List list is judgedkjIn whether there is by Document Length descending arrange document vector di’If depositing
Then turning to step 11 and performing, otherwise, turning to step 18 and perform;
In step 11, by the second assisted border delta2It is set to document vector diAnd di’The difference of length, turns to step 12 and performs;
In step 12, document vector d is judgediAnd di’Whether the vectorial d of equal or documenti’It has been be tested that, if meeting, turned to
Step 10 is performed, and is otherwise turned to step 13 and is performed;
In step 13, delta is judged2Less than 0 and delta1-delta2More than di’1-t times of length, if meeting, turns to step
10 perform, and otherwise turn to step 14 and perform;
In step 14, the second assisted border delta is judged2More than or equal to 0 and delta1+delta2More than di1-t times of length, if
Meet, then turn to step 18 and perform, otherwise turn to step 15 and perform;
In step 15, document vector d is judgediAnd di’Jaccard similarities be more than or equal to t, if meet, turn to step 16 hold
OK, step 10 is otherwise turned to perform;
, will in step 16<di,di’>It is added to result to concentration, turns to step 17 and perform;<di,di’>By document vector diAnd di’
The document pair of composition, document vector diAnd di’Jaccard similarities be more than or equal to threshold value when then match, with<di,di’>Form
Add to collection, be otherwise added without to collection;
In step 17, by di’It is added to and has detected document vector set, turns to step 10 and perform;
In step 18, delta1Value increase document vector diSign s at midpointijFrequency, turn to step 19 perform;
In step 19, delta is judged1More than 1-t times of the maximum length that document vector is not detected among in subregion k, if meeting,
Then turn to step 20 to perform, otherwise turn to step 7 and perform;
In step 20, the subregion upper bound and document vector d are judgediThe difference of length be less than or equal to 1-t times of the subregion upper bound, if satisfaction,
Then turn to step 21 to perform, otherwise turn to step 3 and perform;T is the critical value ratio of Jaccard similarities, and value is according to reality
Depending on situation, span is 0.5-1;
In step 21, by interval coboundary pk+1It is set to interval lower boundary pk, iteration to next subregion, turn to step 6 perform;
In step 22, returning result turns to step 23 and performed to collection;
In step 23, terminate whole program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410779353.6A CN104392002B (en) | 2014-12-15 | 2014-12-15 | A kind of the approximate of extensive collections of web pages repeats lookup method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410779353.6A CN104392002B (en) | 2014-12-15 | 2014-12-15 | A kind of the approximate of extensive collections of web pages repeats lookup method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104392002A CN104392002A (en) | 2015-03-04 |
CN104392002B true CN104392002B (en) | 2017-09-26 |
Family
ID=52609906
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410779353.6A Active CN104392002B (en) | 2014-12-15 | 2014-12-15 | A kind of the approximate of extensive collections of web pages repeats lookup method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104392002B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526719B (en) * | 2016-06-19 | 2020-10-09 | 北京云量数盟科技有限公司 | Chinese document gene extraction method based on mixed features |
CN110209659A (en) * | 2019-06-10 | 2019-09-06 | 广州合摩计算机科技有限公司 | A kind of resume filter method, system and computer readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101226533A (en) * | 2007-12-28 | 2008-07-23 | 腾讯科技(北京)有限公司 | Method and system for arranging web page again |
CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
CN101714147A (en) * | 2008-10-06 | 2010-05-26 | 易搜比控股公司 | Method for filtering same or similar files |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7809710B2 (en) * | 2001-08-14 | 2010-10-05 | Quigo Technologies Llc | System and method for extracting content for submission to a search engine |
-
2014
- 2014-12-15 CN CN201410779353.6A patent/CN104392002B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101226533A (en) * | 2007-12-28 | 2008-07-23 | 腾讯科技(北京)有限公司 | Method and system for arranging web page again |
CN101714147A (en) * | 2008-10-06 | 2010-05-26 | 易搜比控股公司 | Method for filtering same or similar files |
CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
Also Published As
Publication number | Publication date |
---|---|
CN104392002A (en) | 2015-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106295796B (en) | entity link method based on deep learning | |
CN107480143B (en) | Method and system for segmenting conversation topics based on context correlation | |
Ljubešić et al. | hrWaC and slWaC: Compiling web corpora for Croatian and Slovene | |
CN107861939A (en) | A kind of domain entities disambiguation method for merging term vector and topic model | |
CN107122340B (en) | A kind of similarity detection method of the science and technology item return based on synonym analysis | |
CN109241294A (en) | A kind of entity link method and device | |
CN110888991B (en) | Sectional type semantic annotation method under weak annotation environment | |
CN103488648A (en) | Multilanguage mixed retrieval method and system | |
CN104636465A (en) | Webpage abstract generating methods and displaying methods and corresponding devices | |
US20190163737A1 (en) | Method and apparatus for constructing binary feature dictionary | |
CN108509409A (en) | A method of automatically generating semantic similarity sentence sample | |
CN104866572A (en) | Method for clustering network-based short texts | |
CN102622338A (en) | Computer-assisted computing method of semantic distance between short texts | |
CN104199965A (en) | Semantic information retrieval method | |
CN109670014A (en) | A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning | |
CN110705261B (en) | Chinese text word segmentation method and system thereof | |
CN105589976A (en) | Object entity determining method and device based on semantic correlations | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN104392002B (en) | A kind of the approximate of extensive collections of web pages repeats lookup method | |
CN106649251B (en) | A kind of method and device of Chinese word segmentation | |
Chader et al. | Sentiment Analysis for Arabizi: Application to Algerian Dialect. | |
CN105574004B (en) | A kind of removing duplicate webpages method and apparatus | |
CN111522945A (en) | Poetry style analysis method based on chi-square test | |
Rinjeni et al. | Matching Scientific Article Titles using Cosine Similarity and Jaccard Similarity Algorithm | |
Saeidi et al. | Graph representation learning in document wikification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |