CN104809182A

CN104809182A - Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)

Info

Publication number: CN104809182A
Application number: CN201510185467.2A
Authority: CN
Inventors: 杨鹏; 袁志伟; 刘旋
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2015-04-17
Filing date: 2015-04-17
Publication date: 2015-07-29
Anticipated expiration: 2035-04-17
Also published as: CN104809182B

Abstract

The invention discloses a method for web crawler URL (uniform resource locator) deduplicating based on a DSBF (dynamic splitting Bloom Filter). The method is based on the DSBF, and is different from the Bloom Filter which is of a fixed structure and uniformly bears the URL storage tasks in an Interner Archive crawler and an Apoide crawler, and the method has a dynamic extensible structure which can be flexibly split into a plurality of layers according to the requirements. The method for the web crawler URL deduplicating based on the DSBF has the advantages that the number of the processed URLs can be continuously increased, the false positive false judging rate of the Bloom Filter can be controlled within the setting range, and the Bloom Filter has a flexible storage structure with easy distributing; the method is more suitable for constructing the large-scale, distributed and multiple-web crawler type parallel processing environment, and can support the high-efficiency collecting and treatment of massive webpage information of an internet.

Description

Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter

Technical field

The present invention relates to a kind of URL De-weight method of web crawlers, the method can be used for realizing the application of extensive, distributed high performance network reptile, specifically based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, belong to Internet technical field.

Background technology

Web crawlers (Web Crawler) is the important component part of a lot of internet information acquisition system, and it, can according to the webpage in certain regular automatic creeping internet with the URL of webpage basis.Because the URL quantity in internet is hundreds of millions of, and interlink between different URL, so in order to avoid repeating the identical URL that creeps, web crawlers needs to judge whether URL current to be creeped was creeped, and this process is called URL duplicate removal in crawling process.The key realizing URL duplicate removal is, how the URL information of having creeped to be stored in a set, and to ensure that this set has good query performance.Whether URL duplicate removal is efficiently the key factor affecting information acquisition system efficiency.

Bloom Filter is a kind of effective solution supporting URL duplicate removal.Classical Bloom Filter is made up of 1 scale-of-two array and k Hash mapping function (being designated as MapHash), and as shown in Figure 1, it can be used to detect an element fast and whether belongs to given set its structure.Fig. 1 is a classical Bloom Filter, and its scale-of-two array length is 12, uses 3 Hash mapping functions, when inserting x to this Bloom Filter _itime, first use 3 Hash mapping functions to x _icarry out 3 Hash calculation, assumed calculation result is 0,3 and 6 respectively, then only needing the 0th in scale-of-two array, 3 and 6 positions is 1; Insert x _jprocess similar with it.As inquiry x _itime, same with 3 Hash mapping functions to x _icarry out 3 Hash calculation, then whether the corresponding positions of inquiring about in scale-of-two array is all 1.As seen from the figure, x _iamong this Bloom Filter, and x _knot among this Bloom Filter.If the scale-of-two array length of a Bloom Filter is m, then the codomain of its each Hash mapping function is S={0,1,2, m-1}, Hash mapping function can directly be realized m delivery by common hash function (as MD5, SHA, MurmurHash, BKDRHash etc.).General when inserting certain data element to a data acquisition, in Bloom Filter, use k Hash mapping function to carry out k Hash calculation to this data element accordingly, and will be that all binary digits of lower target are set to 1 with this k Hash calculation result in scale-of-two array.So, when inquire about certain data element whether in data acquisition time, only k Hash mapping function need be used to carry out k Hash calculation to this data element, then check the value with this k the binary digit that Hash calculation result is corresponding in the scale-of-two array of Bloom Filter, only have and just judge that this data element is in set when they are all 1.Generally speaking, Bloom Filter is the data structure that a space efficiency and time efficiency are all very high.But, if certain element is through k binary digit corresponding to k Hash calculation result, when being set to 1 by other elements inserted before in set, false positive erroneous judgement when inquiring about this element, can be there is.The size of the false positive False Rate of Bloom Filter, depends primarily on the factors such as the size of set, the length of scale-of-two array and the number of Hash mapping function.

, in efficiency, extensibility and performance etc., there is shortcoming in the more existing URL duplicate removal scheme based on Bloom Filter at present.As Internet Archive reptile uses the Bloom Filter of 32KB to store all URL of each website.Apoide reptile uses the Bloom Filter of 8KB to store all URL of each website, and decides a certain bar URL according to the cryptographic hash of website domain name and be specifically stored in which Bloom Filter.Generally speaking, the Bloom Filter organizational form in these schemes is relatively fixing.For some large-scale portal websites (as Sina website, the www.xinhuanet.com etc.), due to the URL Numerous under this domain name, the fixed size Bloom Filter being usually difficult to a use limited length stores; And for some small-scale websites, adopt excessive Bloom Filter certainly will cause the waste of storage space.In addition, some require higher information acquisition system to handling property, often in network distribution environment, dispose the Parallel Crawling that a large amount of web crawlers carries out webpage simultaneously.For the URL duplicate removal problem in this application scenarios, existing Bloom Filter and its some improvement projects are difficult to adapt to mostly.

Summary of the invention

Goal of the invention: for problems of the prior art with not enough, the invention provides a kind of web crawlers URL De-weight method based on Bloom Filter dynamically can be divided.The basis of the method is one and dynamically can divides Bloom Filter (brief note DSBF), it is different from the fixed sturcture Bloom Filter evenly bearing URL access task in Interner Archive reptile and Apoide reptile, but has the dynamic scalable structure that can split into multilayer as required flexibly.Web crawlers URL duplicate removal is realized based on Bloom Filter dynamically can be divided, both can ensure when processed URL number constantly increases, still can the false positive False Rate of Bloom Filter be controlled in given range, the flexible storage structure realized that distributes can be easy to again by allowing Bloom Filter have, thus be more suitable for constructing parallel processing environment that is extensive, distributed, Multi net voting reptile, support highly effective gathering and the process of internet mass info web.

Technical scheme: a kind of web crawlers URL De-weight method based on dynamically dividing Bloom Filter, comprising:

(1) dynamically can divide Bloom Filter and adopt tree-shaped hierarchical storage structure, dynamically can divide in BloomFilter and use two kinds of hash functions, be the Hash mapping function in classical Bloom Filter, another kind is the new Hash mapping function (being designated as LocHash) introduced.A Hash mapping function is deposited in tree root (being designated as the 0th layer).In each layer except tree root, deposit a classical Bloom Filter (being called blade Bloom Filter) in each leaf node, the URL information that it had crawled for recording web crawlers; Deposit a Hash mapping function in each non-leaf node, non-leaf node is led URL further by it the corresponding child node of its lower one deck.Dynamically can divide in Bloom Filter, for supporting the flexible division of leaf node, each blade Bloom Filter can associate a URL set, and its element has been inserted into all URL in this blade Bloom Filter.Relate generally to 4 kinds of operations based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, comprise initialization, insertion process, fission process and query script etc.

(2) initialization procedure of Bloom Filter can dynamically be divided.First, according to the discreet value of pending URL number, determine the number of the ground floor node set.Computing method are as follows: suppose to need the URL number stored to be n, and the scale-of-two array length in each blade Bloom Filter is m, and requires that false positive False Rate is lower than f.Then dynamically can divide the node number c of the ground floor of Bloom Filter:

In ground floor, each blade Bloom Filter is when false positive False Rate is not more than f, and the URL number u that most multipotency stores is:

And the Hash mapping function number k needed for each blade Bloom Filter is:

Then, select a codomain for 0,1,2 ..., the Hash mapping function of c-1}, leaves the root node of tree in by it.Finally, in the ground floor of tree, c blade Bloom Filter is placed.

(3) insertion process of Bloom Filter can dynamically be divided.When web crawlers URL information be inserted into dynamically can divide Bloom Filter time, need to carry out h (h >=1 to URL, h is the path in tree from root node to leaf node) Hash location Calculation, to determine which blade Bloom Filter is this URL information be specifically inserted in.Concrete insertion process is as follows: first, uses the Hash mapping function of root node to carry out Hash calculation to this URL, judges this URL information is stored in which child node of root node with this.Then, judge whether this child node is leaf node.If so, then this URL is inserted into during the URL associated with this blade BloomFilter gathers, by Hash mapping function, this URL information is inserted in the scale-of-two array of blade Bloom Filter simultaneously.If not, then use this non-leaf node Hash mapping function continue said process, until this URL is finally located and be inserted into a blade Bloom Filter and association thereof URL set till.

(4) fission process of Bloom Filter can dynamically be divided.When the URL quantity that certain blade Bloom Filter (might as well be designated as node l) that dynamically can divide Bloom Filter stores is greater than u, the false positive False Rate of this blade Bloom Filter will be made more than f, therefore need to divide this leaf node l.Concrete fission process is as follows: first, is waiting that the lower one deck dividing node l increases c blade Bloom Filter newly, and is making them become the child node of node l.Then, for treating that division node l increases a Hash mapping function.Then re-start Hash calculation (again Hash) with all URL during this Hash mapping function couple URL associated with node l gathers, and then make these URL each self-align and be inserted in the concrete corresponding child node (blade Bloom Filter) of node l.Finally, delete the URL set of waiting to divide blade Bloom Filter and the association thereof stored in node l, and only retain the Hash mapping function in node l.

(5) query script of Bloom Filter can dynamically be divided.When web crawlers is to when dynamically can divide BloomFilter inquiry URL information, its implementation is similar to insertion process.First, dynamically can divide BloomFilter and carry out h (h >=1) Hash location Calculation, to determine which blade Bloom Filter is this URL information be specifically stored in by the URL to be checked of input.Then according to the querying method of classical Bloom Filter, in this blade Bloom Filter, concrete query manipulation is performed to URL then.

(6) the choosing of Hash mapping function.Dynamically can divide Bloom Filter to organize by tree construction, it uses Hash mapping function to determine a URL is specifically stored in which node of tree.Bear store tasks equably to make each blade Bloom Filter of tree as far as possible, different Hash mapping functions (can select respectively from the hash functions such as MD5, SHA, MurmurHash, BKDRHash) should be selected, to avoid occurring too similar Hash result in layers in different layers; But within the same layer, the Hash mapping function of non-leaf node can be identical.Similar with the implementation of Hash mapping function, Hash mapping function also can be realized c delivery by common hash function.Usual m much larger than c (such as, when require False Rate lower than ten thousand/for the moment, if will store 1,000,000 URL, then the size of m is at least 2 × 10 ⁷, and c generally can not more than 100), so Hash mapping function and Hash mapping function can be selected based on identical hash function, then realize respectively by c delivery with to m delivery.

Beneficial effect: dynamically can divide the advantage that Bloom Filter not only maintains classical Bloom Filter, there is outstanding time efficiency and space efficiency; And its tree-like storage structure of adopting, make it can split into multilayer as required flexibly and carry out dynamic expansion storage capacity.Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, by URL duplicate removal task is shared multiple blade Bloom Filter equably, be more suitable for constructing parallel processing environment that is extensive, distributed, Multi net voting reptile, support highly effective gathering and the process of internet mass info web.

Accompanying drawing explanation

Fig. 1 is a classical Bloom Filter;

Fig. 2 be the tree construction obtained through initialization dynamically can divide Bloom Filter;

Fig. 3 is in dynamically can divide Bloom Filter in fission process, and what division wherein occurred is the 4th node of ground floor, is designated as DSBF _{(root, 3)}, the node DSBF divided _{(root, 3)}in, both remained with original blade Bloom Filter, and also increased the Hash mapping function positioned for dividing 5 nodes to the second layer newly simultaneously, be designated as LocHash _{(root, 3)};

Fig. 4 dynamically can divide Bloom Filter, its node DSBF after fission process terminates _{(root, 3)}in store Hash mapping function LocHash _{(root, 3)}, in subsequent operation, allly navigate to node DSBF from last layer _{(root, 3)}uRL, all will again through Hash mapping function LocHash _{(root, 3)}carry out Hash calculation, then by continuation guiding node DSBF _{(root, 3)}next straton node.

Embodiment

Below in conjunction with specific embodiment, illustrate the present invention further, these embodiments should be understood only be not used in for illustration of the present invention and limit the scope of the invention, after having read the present invention, the amendment of those skilled in the art to the various equivalent form of value of the present invention has all fallen within the application's claims limited range.

Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, comprising:

(1) first will construct one and dynamically can divide Bloom Filter, the scale-of-two array of its each blade Bloom Filter adopts Redis database to store.Redis is the very superior memory database of readwrite performance, but the content stored close to or can performance rapid drawdown when exceeding memory size.So, first according to scale and the feature of web crawlers application, determine the length m of the scale-of-two array of applicable blade Bloom Filter, make it the memory size of the computing machine being less than operational network reptile.Dynamically can divide Bloom Filter and adopt tree-shaped hierarchical storage structure, as shown in Figure 2 and Figure 4.

Fig. 2 be the tree construction obtained through initialization dynamically can divide Bloom Filter.The root node of tree houses a Hash mapping function, and the ground floor of tree has 5 blade Bloom Filter, and each blade BloomFilter is a classical Bloom Filter.Web crawlers, when carrying out URL insertion or inquiry, first must determine the concrete memory location of this URL according to the Hash mapping function in root node.Fig. 4 dynamically can divide Bloom Filter, its node DSBF after fission process terminates _{(root, 3)}in store Hash mapping function LocHash _{(root, 3)}.In subsequent operation, allly navigate to node DSBF from last layer _{(root, 3)}uRL, all will again through Hash mapping function LocHash _{(root, 3)}carry out Hash calculation, then by continuation guiding node DSBF _{(root, 3)}next straton node.

And the Hash mapping function number k needed for each blade Bloom Filter is:

In conjunction with web crawlers application demand and crawl strategy, analyze the organization rule of Zhong Daipa website, internet and Web page, thus estimate the URL quantity n of the required process of web crawlers application.Then, according to m, n and false positive False Rate upper limit f, according to the description of (2) in technical scheme, calculate the parameters such as ground floor node number c, the Hash mapping function number k needed for each blade Bloom Filter that dynamically can divide BloomFilter and its maximum URL number u that can store.

The c=5 obtained by this procedure initialization dynamically can divide Bloom Filter as shown in Figure 2.

(3) according to the value of parameter m, choose k codomain for 0,1,2 ..., the hash function of m-1}, they serve as the Hash mapping function of ground floor c blade Bloom Filter.Hash mapping function can be realized m delivery by common hash function (as MD5, SHA, MurmurHash, BKDRHash etc.) again.Meanwhile, for the root node that dynamically can divide Bloom Filter choose a codomain for 0,1,2 ..., the Hash mapping function of c-1}, it can be realized c delivery by common hash function again.

(4) according to parameters such as above-mentioned m, n, f, c, k and u, and the k determined a Hash mapping function and 1 Hash mapping function, initialization obtains one dynamically can divide Bloom Filter, and its URL associated by each blade Bloom Filter gathers with database (as a MySQL) realization.

(5) by dynamically can dividing based on Bloom Filter of being constructed by preceding method, the URL duplicate removal of web crawlers is realized.Concrete grammar is as follows: when web crawlers is for crawling a URL, first inquires about this URL, to judge whether it was crawled dynamically can dividing in Bloom Filter.If so, then omit this URL (namely this URL is by duplicate removal), and again process next URL.If not, show that this URL dynamically can divide in Bloom Filter current the new URL do not stored, then now must arrive certain blade Bloom Filter.So, next first judge whether the URL quantity that this blade Bloom Filter stores reaches upper limit u.If so, then first according to the description of (4) in technical scheme, this blade Bloom Filter to be divided, then according to the description of (3) in technical scheme, URL is inserted into and dynamically can divides in Bloom Filter after division.If not, then the direct description according to (3) in technical scheme, is inserted into URL in this blade Bloom Filter.

(6) when using multiple web crawlers to carry out webpage parallel acquisition at network distribution environment, only need adopt suitable Distribution Strategy according to dynamically dividing Bloom Filter, such as allow multiple blade Bloom Filter (and association URL set) difference distributed store on multiple stage computing machine, each web crawlers just can be allowed to dispose separately on one computer.And when implementing URL duplicate removal, each web crawlers only need be responsible for the associative operation of a few (being generally one) blade Bloom Filter and association URL set (by database realizing) thereof respectively, this will improve the efficiency of URL duplicate removal greatly, and allows web crawlers for crawling webpage actual, gathering and the complex process such as analysis in the more a high proportion of processing time.

Claims

1., based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, comprise the steps:

(1) first to construct one and dynamically can divide Bloom Filter, the scale-of-two array of its each blade Bloom Filter adopts Redis database to store, according to scale and the feature of web crawlers application, determine the length m of the scale-of-two array of applicable blade Bloom Filter, make it the memory size of the computing machine being less than operational network reptile;

(2) in conjunction with web crawlers application demand and crawl strategy, analyze the organization rule of Zhong Daipa website, internet and Web page, thus estimate the URL quantity n of the required process of web crawlers application; Then, according to m, n and false positive False Rate upper limit f, calculate ground floor node number c, the Hash mapping function number k needed for each blade Bloom Filter and its maximum URL number u that can store that dynamically can divide Bloom Filter;

(3) according to the value of parameter m, choose k codomain for 0,1,2 ..., the hash function of m-1}, they serve as the Hash mapping function of ground floor c blade Bloom Filter; Meanwhile, for the root node that dynamically can divide Bloom Filter choose a codomain for 0,1,2 ..., the Hash mapping function of c-1}, it can be realized c delivery by hash function again;

(4) according to above-mentioned m, n, f, c, k and u parameter, and the k determined a Hash mapping function and 1 Hash mapping function, initialization obtains one dynamically can divide Bloom Filter, and its URL associated by each blade Bloom Filter gathers with a database realizing;

(5) dynamically can divide based on Bloom Filter, the URL duplicate removal of web crawlers is realized;

(6) when using multiple web crawlers to carry out webpage parallel acquisition at network distribution environment, only need adopt Distribution Strategy according to dynamically dividing Bloom Filter, each web crawlers just can be allowed to dispose separately on one computer; And when implementing URL duplicate removal, each web crawlers only need be responsible for the associative operation of one or several blade Bloom Filter and association URL set thereof respectively.

2. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, dynamically can divide Bloom Filter and adopt tree-shaped hierarchical storage structure, dynamically can divide in Bloom Filter and use two kinds of hash functions, be the Hash mapping function in Bloom Filter, another kind is Hash mapping function; Deposit a Hash mapping function in tree root, in each layer except tree root, in each leaf node, depositing a Bloom Filter, being called blade Bloom Filter, the URL information that it had crawled for recording web crawlers; Deposit a Hash mapping function in each non-leaf node, non-leaf node is led URL further by it the corresponding child node of its lower one deck; Dynamically can divide in BloomFilter, for supporting the flexible division of leaf node, each blade Bloom Filter can associate a URL set, and its element has been inserted into all URL in this blade Bloom Filter.

3. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, dynamically can divide in the initialization procedure of Bloom Filter, first, according to the discreet value of pending URL number, determine the number of the ground floor node set; Computing method are as follows: suppose to need the URL number stored to be n, and the scale-of-two array length in each blade Bloom Filter is m, and requires that false positive False Rate is lower than f; Then dynamically can divide the node number c of the ground floor of Bloom Filter:

And the Hash mapping function number k needed for each blade Bloom Filter is:

Then, select a Hash mapping function, it is left in the root node of tree; Finally, in the ground floor of tree, c blade Bloom Filter is placed.

4. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, the concrete grammar realizing the URL duplicate removal of web crawlers is as follows: when web crawlers is for crawling a URL, first inquire about this URL, to judge whether it was crawled dynamically can dividing in Bloom Filter; If so, then omit this URL (namely this URL is by duplicate removal), and again process next URL; If not, show that this URL dynamically can divide in Bloom Filter current the new URL do not stored, then now must arrive certain blade Bloom Filter; So, next first judge whether the URL quantity that this blade Bloom Filter stores reaches upper limit u; If so, then first this blade Bloom Filter is divided, then URL is inserted into dynamically can divides in Bloom Filter after division; If not, then directly URL to be inserted in this blade Bloom Filter.

5. as claimed in claim 4 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, when web crawlers URL information be inserted into dynamically can divide Bloom Filter time, need to carry out h Hash location Calculation to URL, to determine which blade Bloom Filter is this URL information be specifically inserted in; Concrete insertion process is as follows: first, uses the Hash mapping function of root node to carry out Hash calculation to this URL, judges that this URL information is stored in which child node of root node with this; Then, judge whether this child node is leaf node; If so, then this URL is inserted into during the URL associated with this blade BloomFilter gathers, by Hash mapping function, this URL information is inserted in the scale-of-two array of blade Bloom Filter simultaneously; If not, then use this non-leaf node Hash mapping function continue said process, until this URL is finally located and be inserted into a blade Bloom Filter and association thereof URL set till.

6. as claimed in claim 4 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, dynamically can divide in the fission process of Bloom Filter, when the URL quantity that certain blade Bloom Filter (might as well be designated as node l) that dynamically can divide BloomFilter stores is greater than u, the false positive False Rate of this blade Bloom Filter will be made more than f, therefore need to divide this leaf node l; Concrete fission process is as follows: first, is waiting that the lower one deck dividing node l increases c blade BloomFilter newly, and is making them become the child node of node l; Then, for treating that division node l increases a Hash mapping function; Then re-start Hash calculation (again Hash) with all URL during this Hash mapping function couple URL associated with node l gathers, and then make these URL each self-align and be inserted in the concrete corresponding child node of node l; Finally, delete the URL set of waiting to divide blade Bloom Filter and the association thereof stored in node l, and only retain the Hash mapping function in node l.

7. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, when web crawlers is to when dynamically can divide Bloom Filter inquiry URL information, its implementation is similar to insertion process; First, dynamically can divide Bloom Filter and carry out h Hash location Calculation, to determine which blade BloomFilter is this URL information be specifically stored in by the URL to be checked of input; Then according to the querying method of classical Bloom Filter, in this blade Bloom Filter, concrete query manipulation is performed to URL then.

8. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, dynamically can divide Bloom Filter to organize by tree construction, it uses Hash mapping function to determine that a URL is specifically stored in which node of tree; In order to make each blade BloomFilter of tree bear store tasks equably as far as possible, different Hash mapping functions should be selected in different layers, to avoid occurring too similar Hash result in layers; But within the same layer, the Hash mapping function of non-leaf node can be identical; Similar with the implementation of Hash mapping function, Hash mapping function also can be realized c delivery by common hash function; Usual m is much larger than c, so Hash mapping function and Hash mapping function can be selected based on identical hash function, then realizes respectively by c delivery with to m delivery.