CN104392002B - A kind of the approximate of extensive collections of web pages repeats lookup method - Google Patents

A kind of the approximate of extensive collections of web pages repeats lookup method Download PDF

Info

Publication number
CN104392002B
CN104392002B CN201410779353.6A CN201410779353A CN104392002B CN 104392002 B CN104392002 B CN 104392002B CN 201410779353 A CN201410779353 A CN 201410779353A CN 104392002 B CN104392002 B CN 104392002B
Authority
CN
China
Prior art keywords
perform
turn
document
document vector
delta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410779353.6A
Other languages
Chinese (zh)
Other versions
CN104392002A (en
Inventor
张鹏
熊翠文
刘庆云
杨嵘
郑超
刘俊朋
李舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201410779353.6A priority Critical patent/CN104392002B/en
Publication of CN104392002A publication Critical patent/CN104392002A/en
Application granted granted Critical
Publication of CN104392002B publication Critical patent/CN104392002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of method for approximately repeating to search the present invention relates to extensive collections of web pages, carry out filtering web page content noise using the point signature of document, approximate repetition is completed with reference to subregion and inverted index beta pruning to search, so that approximately repeating search efficiency height, the Jaccard similarities for only calculating point signature make it that the complexity of method is very low.

Description

A kind of the approximate of extensive collections of web pages repeats lookup method
Technical field
The present invention relates to information retrieval technique, core content approximately repeats to search in particularly a kind of extensive collections of web pages Method.
Background technology
With the rapid development of Internet, the data scale of interconnection online storage constantly expands, want quickly to be wanted Data, search engine turn into main means.Although search engine eliminates repeated pages in search result is returned, Approximate repeated pages are continuously emerged in corpus, to obtain more accurate valuable as a result, it is desirable to search these approximate repetitions Webpage, and removed.However, traditional approximate lookup method that repeats has many weak points, such as shingling methods Too high based on the overlapping complexities of shingles or n-gram subsets calculating Jaccard, fingerprint pattern method is based on document offset To select bit string to bring very big overhead, multiple signatures are mapped single cryptographic Hash and add signature by local sensitivity hash method Extract the overhead with Hash.
The content of the invention
The technology of the present invention solves problem:The deficiencies in the prior art are overcome approximately to be repeated there is provided a kind of extensive collections of web pages The method of lookup, carrys out filtering web page content noise using the point signature of document, completes near with reference to subregion and inverted index beta pruning Searched like repetition so that approximate repetition search efficiency is high, the Jaccard similarities of only calculating point signature cause the complexity of method It is very low.
The technology of the present invention solution:A kind of the approximate of extensive collections of web pages repeats lookup method, process such as Fig. 1 institutes Show:
In step 0, start to perform the present invention, turn to step 1 and perform;
In step 1, input carries the document vector of weighted point signature, with border [pk,pk+1) subregion k and fall arrange Table (now Inverted List is sky), turns to step 2 and performs;[pk,pk+1) be Document Length scope, be half-open half closed zone Between, pkFor interval lower boundary, pk+1For interval coboundary;
In step 2, that deposits result is initialized as sky to collection, turns to step 3 and performs;
In step 3, judge in the randomly ordered sequence of Document Length with the presence or absence of the document vector d not being calculatedi, have Then turn to step 4 to perform, otherwise turn to step 22 and perform;
In step 4, subregion k interval lower boundary pkDocument vector d is setiLength, turn to step 5 perform;
In step 5, by d in subregion kiAll point signatures do an ascending order arrangement according to frequency, by point signature frequency most Small point signature is placed in first, turns to step 6 and performs;
In step 6, by the first assisted border delta10 is set to, will detect that document vector set is set to sky, turned to step 7 Perform;
In step 7, d is judgediPoint signature with the presence or absence of do not calculated point signature sij, wherein, j is point signature in di In position, if in the presence of, turn to step 8 perform, otherwise turn to step 20 perform;
In step 8, with s in subregion kijRelevant documentation vector inserts Inverted List listkjIn, and by Document Length descending Arrangement, turns to step 9 and performs;
In step 9, by the second assisted border delta20 is set to, step 10 is turned to and performs;
In step 10, list is judgedkjIn whether there is by Document Length descending arrange document vector di', if in the presence of, Turn to step 11 to perform, otherwise, turn to step 18 and perform;
In step 11, delta2It is set to document vector diAnd di’The difference of length, turns to step 12 and performs;
In step 12, document vector d is judgediAnd di’Whether the vectorial d of equal or documenti’It has been be tested that, if meeting, Turn to step 10 to perform, otherwise turn to step 13 and perform;
In step 13, delta is judged2Less than 0 and delta1-delta2More than di’1-t times of length, if meeting, is turned to Step 10 is performed, and is otherwise turned to step 14 and is performed;
In step 14, delta is judged2More than or equal to 0 and delta1+delta2More than di1-t times of length, if meeting, Turn to step 18 to perform, otherwise turn to step 15 and perform;
In step 15, document vector d is judgediAnd di’Jaccard similarities be more than or equal to t, if meet, turn to step 16 perform, and otherwise turn to step 10 and perform;
, will in step 16<di,di’>It is added to result to concentration, turns to step 17 and perform;<di,di’>By document vector di And di' composition document pair, document vector diAnd di’Jaccard similarities be more than or equal to threshold value when then match, with<di,di’> Form is added to collection, is otherwise added without to collection.
In step 17, by di’It is added to and has detected document vector set, turns to step 10 and perform;
In step 18, delta1Value increase document vector diSign s at midpointijFrequency, turn to step 19 perform;
In step 19, delta is judged1More than 1-t times of the maximum length that document vector is not detected among in subregion k, if Meet, then turn to step 20 and perform, otherwise turn to step 7 and perform;
In step 20, the subregion upper bound and document vector d are judgediThe difference of length be less than or equal to 1-t times of the subregion upper bound, if Meet, then turn to step 21 and perform, otherwise turn to step 3 and perform;T is the critical value ratio of Jaccard similarities, value according to Depending on actual conditions, typically between 0.5 to 1;
In step 21, by interval coboundary pk+1It is set to interval lower boundary pk, iteration to next subregion, turn to step 6 hold OK;
In step 22, returning result turns to step 23 and performed to collection;
In step 23, terminate whole program.
The advantage of the present invention compared with prior art is:
(1) existing approximate repeated pages detection method mainly has shingling algorithms, fingerprint pattern, local sensitivity to breathe out It is uncommon three kinds.Shingling algorithms are overlapping based on shingles or n-grams calculating Jaccard, and its complexity is too high, in this base The improvement of " super shingles " is introduced on plinth can only also reduce the precision of result by a small margin;Fingerprint pattern is extracted to be had in document The cryptographic Hash connection position of the word of representative or whole sentence concatenates into fingerprint characteristic to judge document similarity, but bit string is selected Produce very big overhead;The connection of local sensitivity Hash, to single cryptographic Hash, passes through independent Kazakhstan from each data object Uncommon function, adds additional the expense of signature extraction and multiple Hash.
(2) short chain that the present invention is connected into using deactivation antecedent with adjacent content item is as document signature, and signature is extracted Speed is fast, expense is small;When calculating Jaccard similarities, computer capacity is reduced with reference to set partition, it is to avoid calculate substantially dissimilar Document vector between similarity, introduce inverted index beta pruning eliminate redundant computation process accelerate calculation procedure.The present invention Method can reach accurately and fast, the low purpose of computation complexity.
Brief description of the drawings
Fig. 1 is the inventive method implementation process figure;
Fig. 2 is that the signature approximate lookup method that repeats in midpoint of the present invention performs figure.
Embodiment
Before describing the present invention, explanation and illustration once first is carried out to related notion.
Point (p):The word often occurred in document, referred to as stop words in natural language text, such as is, the, do, Have etc..
Popular word:It is not the word of point in document.
Point is away from (d):In document between popular word and point selected before it or popular word and the generic word before it The vocabulary number being spaced between remittance, puts not in this counting, such as " a rally to kick ", and vocabulary " a " and " to " are in a document All often occur, can serve as a little, the point between " rally " and " a " away from for 1, point between " kick " and " a " away from for 2, point between " kick " and " rally " is away from for 1 (because " to " as point not in counting), the point between " kick " and " to " Away from for 1.
Chain length (c):In defined document relative to some point meet point away from popular word number, such as " a rally To kick ", relative to point " a " point away from the chain a length of 2 for 1.
Point signature (s):Popular word chained list for searching an a length of c of continuous chain, popular word in chained list it Between at intervals of point away from d, wherein at intervals of d between first in chained list popular word and point, representation is p (d, c), i.e., Point signature=point (point away from:Chain length), such as " a rally to kick ", it is point to select " a ", and point is 1 away from d, and chain length c is 2, point S=" a " (1,2) value of signing is { " a ":"rally":"kick"}.
Point signature frequency:The number of times that point signature occurs in a document, such as " a rally to kick off a Weeklong campaign ", the number of times that point signature s=" a " (1,2) occurs is that 2, i.e. frequency are 2.
Point signature weight:Put signature frequency.
Document Length (| d |):All point signature frequency sums in document, such as " a rally to kick off a Weeklong campaign ", point signature s1The number of times that=" a " (1,2) occurs is 2, point signature s2Time that=" to " (1,2) occurs Number is 1, and Document Length is 2+1=3.
Document vector (d):The sequence that document representation is signed into the point with frequency, i.e. document vector={ point signature 1:Point Sign 1 frequency, point signature 2:2 frequencies ... of point signature }, such as d=" a rally to kick off a weeklong Campaign ", document vector representation is into d={ s1:2,s2:1 }, wherein s1=" a " (1,2), s2=" to " (1,2), Document Length | d |=3.
Subregion k:The scope of Document Length size, is a semi-closure half open interval, and interval lower bound is fixed, and the upper bound is not solid It is fixed.[pk,pk+1) be Document Length scope, be a half-open intervals, pkFor interval lower boundary, pk+1For interval coboundary.
Inverted List (list):With identical point sign document Vector Groups into list, i.e. Inverted List=document vector 1, document vector 2 ... }, each document vector carries the signature frequency, such as s in list1=" a " (1,2), s2=" to " (1,2), s3=" the " (2,1), d1=" a rally to kick off a weeklong campaign ", d1={ s1:2, s2:1 }, d2=" a biological argument for the cougar phenomenon ", d2={ s1:1,s3:1 }, d3 =" the past can be corrected ", d3={ s3:1 }, list1:{d1:2,d2:1 } (point signature s1=" a " (1,2) is right The Inverted List answered), list2:{d1:1 } (point signature s2The corresponding Inverted Lists of=" to " (1,2)), list3:{d2:1,d3:1} (point signature s3The corresponding Inverted Lists of=" the " (2,1)).
Jaccard similarities sim:The smaller value sum divided by two documents of corresponding signature frequency in two documents vector The higher value sum of corresponding signature frequency, sim (d in vector1,d2)=(1+0+0)/(2+1+1)=0.25.
Threshold value (t):The critical value ratio of Jaccard similarities, depending on value is according to actual conditions, typically 0.5 to 1 it Between.
Assisted border (delta):Variable for blotter numerical value.
To collection<di,di’>}:I.e. document is to set, two documents vector diAnd di’Jaccard similarities be more than or equal to threshold Then matched during value, two documents vector with "<Document vector 1, document vector 2>", i.e., "<di,di’>", form is added to concentrating, no Then mismatch, be added without to collection.
The present invention will be described in detail below in conjunction with above-mentioned.
The execution figure of the approximate duplicate removal of collections of web pages is as shown in Figure 2.The web document vector calculated in Fig. 2 is d1={ s1:5, s2:4,s3:4}、d2={ s1:8,s2:4 } and d3={ s1:4,s2:5,s3:5 }, document vector d1、d2And d3Length be respectively 13, 12 and 14, threshold value t are set to 0.8 implementation procedure and are described as follows:
(1) the document vector d that input is signed with weighted point1、d2And d3, input border is [pk,pk+1) subregion and fall Permutation table;
(2) empty is set to collection, document vector presses length random alignment, randomly chooses d1Compared with other document vectors;
(3) subregion lower boundary is set to document vector d1Length 13, point signature s1、s2And s3Frequency be respectively 17, 13rd, 9, point signature does an ascending sort according to the size of frequency, and it is s that selection, which calculates the minimum point signature of point signature frequency,3Under Lists of documents, assisted border delta10 is set to, has detected that document vector set is set to empty;
(4) the document vector under each point signature is arranged according to Document Length descending, first traversal point signature s3Text Shelves vector set is put into Inverted List, detection document vector d3, assisted border delta2=d1Length-d3Length=- 1<0, Document vector d1With d3Unequal and d3It is not detected, delta1-delta2(=1) it is less than (1-0.8) * 14 (=2.8), delta1 +delta2(=- 1) it is less than (1-0.8) * 13 (=2.6), so calculating document vector d1With a signature s3In document vector d3 Jaccard similarity sim (d1,d3)=(4+4+4)/(5+5+5)=0.8 >=t, will<d1,d3>Add to collection;
(5) test point signature s3Inverted List in document vector d1, document vector d1With d1It is equal, do not calculate, delta1 Increase document vector d1Point signature s3Frequency, delta1=4, and terminate this traversal, second point signature s of traversal1Text Shelves vector set, delta10 is set to, by s1Document vector set be put into list, detection document vector d2, assisted border delta2 =d1Length-d2Length=1>0, document vector d1With d2Unequal and d2It is not detected, delta1-delta2(=- 1) it is small In (1-0.8) * 12 (=2.4), delta1+delta2(=1) it is less than (1-0.8) * 13 (=2.6), so calculating document vector d1 With a signature s3In document vector Jaccard similarities, sim (d1, d2)=(5+4)/(8+4+4)=0.5625<T,<d1, d3>It is added without to collection;
(6) continue to travel through Inverted List, document vector d in list1With d1It is equal, document vector d3It has been detected that, delta1Increase Plus document vector d1Point signature s1Frequency, delta1=5, terminate this traversal, the document sets of the 3rd point signature of traversal s2, delta1It is set to document vector d in 0, list1With d1It is equal, document vector d3、d2It has been detected that, terminate this traversal, return pair Collection<d1,d3>, terminate program.
Above example is provided just for the sake of the description purpose of the present invention, and is not intended to limit the scope of the present invention.This The scope of invention is defined by the following claims.The various equivalent substitutions that do not depart from spirit and principles of the present invention and make and repair Change, all should cover within the scope of the present invention.

Claims (1)

1. a kind of extensive collections of web pages approximately repeats the method searched, it is characterised in that realize that step is as follows:
Step 1, input carries the document vector of weighted point signature, with border [pk,pk+1) subregion k and Inverted List, turn to Step 2 is performed;[pk,pk+1) be Document Length scope, be a half-open intervals, pkFor interval lower boundary, pk+1For interval Coboundary;
Step 2, that deposits result is initialized as sky to collection, turns to step 3 and performs;In step 3, judge that Document Length is randomly ordered With the presence or absence of the document vector d not being calculated in sequencei, exist, turn to step 4 and perform, otherwise turn to step 22 and hold OK;
In step 4, subregion k interval lower boundary pkDocument vector d is setiLength, turn to step 5 perform;
In step 5, by document vector d in subregion kiAll point signatures do an ascending order arrangement according to frequency, will put signature frequency Minimum point signature is placed in first, turns to step 6 and performs;
In step 6, the first assisted border delta10 is set to, will detect that document vector set is set to sky, step 7 is turned to and performs;
In step 7, document vector d is judgediPoint signature with the presence or absence of do not calculated point signature sij, wherein, j signs for point In document vector diIn position, if in the presence of, turn to step 8 perform, otherwise turn to step 20 perform;
In step 8, with a signature s in subregion kijRelevant documentation vector inserts Inverted List listkjIn, and by Document Length descending Arrangement, turns to step 9 and performs;
In step 9, by the second assisted border delta20 is set to, step 10 is turned to and performs;
In step 10, Inverted List list is judgedkjIn whether there is by Document Length descending arrange document vector di’If depositing Then turning to step 11 and performing, otherwise, turning to step 18 and perform;
In step 11, by the second assisted border delta2It is set to document vector diAnd di’The difference of length, turns to step 12 and performs;
In step 12, document vector d is judgediAnd di’Whether the vectorial d of equal or documenti’It has been be tested that, if meeting, turned to Step 10 is performed, and is otherwise turned to step 13 and is performed;
In step 13, delta is judged2Less than 0 and delta1-delta2More than di’1-t times of length, if meeting, turns to step 10 perform, and otherwise turn to step 14 and perform;
In step 14, the second assisted border delta is judged2More than or equal to 0 and delta1+delta2More than di1-t times of length, if Meet, then turn to step 18 and perform, otherwise turn to step 15 and perform;
In step 15, document vector d is judgediAnd di’Jaccard similarities be more than or equal to t, if meet, turn to step 16 hold OK, step 10 is otherwise turned to perform;
, will in step 16<di,di’>It is added to result to concentration, turns to step 17 and perform;<di,di’>By document vector diAnd di’ The document pair of composition, document vector diAnd di’Jaccard similarities be more than or equal to threshold value when then match, with<di,di’>Form Add to collection, be otherwise added without to collection;
In step 17, by di’It is added to and has detected document vector set, turns to step 10 and perform;
In step 18, delta1Value increase document vector diSign s at midpointijFrequency, turn to step 19 perform;
In step 19, delta is judged1More than 1-t times of the maximum length that document vector is not detected among in subregion k, if meeting, Then turn to step 20 to perform, otherwise turn to step 7 and perform;
In step 20, the subregion upper bound and document vector d are judgediThe difference of length be less than or equal to 1-t times of the subregion upper bound, if satisfaction, Then turn to step 21 to perform, otherwise turn to step 3 and perform;T is the critical value ratio of Jaccard similarities, and value is according to reality Depending on situation, span is 0.5-1;
In step 21, by interval coboundary pk+1It is set to interval lower boundary pk, iteration to next subregion, turn to step 6 perform;
In step 22, returning result turns to step 23 and performed to collection;
In step 23, terminate whole program.
CN201410779353.6A 2014-12-15 2014-12-15 A kind of the approximate of extensive collections of web pages repeats lookup method Active CN104392002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410779353.6A CN104392002B (en) 2014-12-15 2014-12-15 A kind of the approximate of extensive collections of web pages repeats lookup method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410779353.6A CN104392002B (en) 2014-12-15 2014-12-15 A kind of the approximate of extensive collections of web pages repeats lookup method

Publications (2)

Publication Number Publication Date
CN104392002A CN104392002A (en) 2015-03-04
CN104392002B true CN104392002B (en) 2017-09-26

Family

ID=52609906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410779353.6A Active CN104392002B (en) 2014-12-15 2014-12-15 A kind of the approximate of extensive collections of web pages repeats lookup method

Country Status (1)

Country Link
CN (1) CN104392002B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526719B (en) * 2016-06-19 2020-10-09 北京云量数盟科技有限公司 Chinese document gene extraction method based on mixed features
CN110209659A (en) * 2019-06-10 2019-09-06 广州合摩计算机科技有限公司 A kind of resume filter method, system and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226533A (en) * 2007-12-28 2008-07-23 腾讯科技(北京)有限公司 Method and system for arranging web page again
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN101714147A (en) * 2008-10-06 2010-05-26 易搜比控股公司 Method for filtering same or similar files

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809710B2 (en) * 2001-08-14 2010-10-05 Quigo Technologies Llc System and method for extracting content for submission to a search engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226533A (en) * 2007-12-28 2008-07-23 腾讯科技(北京)有限公司 Method and system for arranging web page again
CN101714147A (en) * 2008-10-06 2010-05-26 易搜比控股公司 Method for filtering same or similar files
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode

Also Published As

Publication number Publication date
CN104392002A (en) 2015-03-04

Similar Documents

Publication Publication Date Title
CN106295796B (en) entity link method based on deep learning
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
Ljubešić et al. hrWaC and slWaC: Compiling web corpora for Croatian and Slovene
CN107861939A (en) A kind of domain entities disambiguation method for merging term vector and topic model
CN107122340B (en) A kind of similarity detection method of the science and technology item return based on synonym analysis
CN109241294A (en) A kind of entity link method and device
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
CN103488648A (en) Multilanguage mixed retrieval method and system
CN104636465A (en) Webpage abstract generating methods and displaying methods and corresponding devices
US20190163737A1 (en) Method and apparatus for constructing binary feature dictionary
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN104866572A (en) Method for clustering network-based short texts
CN102622338A (en) Computer-assisted computing method of semantic distance between short texts
CN104199965A (en) Semantic information retrieval method
CN109670014A (en) A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning
CN110705261B (en) Chinese text word segmentation method and system thereof
CN105589976A (en) Object entity determining method and device based on semantic correlations
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN104392002B (en) A kind of the approximate of extensive collections of web pages repeats lookup method
CN106649251B (en) A kind of method and device of Chinese word segmentation
Chader et al. Sentiment Analysis for Arabizi: Application to Algerian Dialect.
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN111522945A (en) Poetry style analysis method based on chi-square test
Rinjeni et al. Matching Scientific Article Titles using Cosine Similarity and Jaccard Similarity Algorithm
Saeidi et al. Graph representation learning in document wikification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant