CN104809182A - Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) - Google Patents

Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) Download PDF

Info

Publication number
CN104809182A
CN104809182A CN201510185467.2A CN201510185467A CN104809182A CN 104809182 A CN104809182 A CN 104809182A CN 201510185467 A CN201510185467 A CN 201510185467A CN 104809182 A CN104809182 A CN 104809182A
Authority
CN
China
Prior art keywords
bloom filter
url
dynamically
blade
divide
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510185467.2A
Other languages
Chinese (zh)
Other versions
CN104809182B (en
Inventor
杨鹏
袁志伟
刘旋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201510185467.2A priority Critical patent/CN104809182B/en
Publication of CN104809182A publication Critical patent/CN104809182A/en
Application granted granted Critical
Publication of CN104809182B publication Critical patent/CN104809182B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method for web crawler URL (uniform resource locator) deduplicating based on a DSBF (dynamic splitting Bloom Filter). The method is based on the DSBF, and is different from the Bloom Filter which is of a fixed structure and uniformly bears the URL storage tasks in an Interner Archive crawler and an Apoide crawler, and the method has a dynamic extensible structure which can be flexibly split into a plurality of layers according to the requirements. The method for the web crawler URL deduplicating based on the DSBF has the advantages that the number of the processed URLs can be continuously increased, the false positive false judging rate of the Bloom Filter can be controlled within the setting range, and the Bloom Filter has a flexible storage structure with easy distributing; the method is more suitable for constructing the large-scale, distributed and multiple-web crawler type parallel processing environment, and can support the high-efficiency collecting and treatment of massive webpage information of an internet.

Description

Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter
Technical field
The present invention relates to a kind of URL De-weight method of web crawlers, the method can be used for realizing the application of extensive, distributed high performance network reptile, specifically based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, belong to Internet technical field.
Background technology
Web crawlers (Web Crawler) is the important component part of a lot of internet information acquisition system, and it, can according to the webpage in certain regular automatic creeping internet with the URL of webpage basis.Because the URL quantity in internet is hundreds of millions of, and interlink between different URL, so in order to avoid repeating the identical URL that creeps, web crawlers needs to judge whether URL current to be creeped was creeped, and this process is called URL duplicate removal in crawling process.The key realizing URL duplicate removal is, how the URL information of having creeped to be stored in a set, and to ensure that this set has good query performance.Whether URL duplicate removal is efficiently the key factor affecting information acquisition system efficiency.
Bloom Filter is a kind of effective solution supporting URL duplicate removal.Classical Bloom Filter is made up of 1 scale-of-two array and k Hash mapping function (being designated as MapHash), and as shown in Figure 1, it can be used to detect an element fast and whether belongs to given set its structure.Fig. 1 is a classical Bloom Filter, and its scale-of-two array length is 12, uses 3 Hash mapping functions, when inserting x to this Bloom Filter itime, first use 3 Hash mapping functions to x icarry out 3 Hash calculation, assumed calculation result is 0,3 and 6 respectively, then only needing the 0th in scale-of-two array, 3 and 6 positions is 1; Insert x jprocess similar with it.As inquiry x itime, same with 3 Hash mapping functions to x icarry out 3 Hash calculation, then whether the corresponding positions of inquiring about in scale-of-two array is all 1.As seen from the figure, x iamong this Bloom Filter, and x knot among this Bloom Filter.If the scale-of-two array length of a Bloom Filter is m, then the codomain of its each Hash mapping function is S={0,1,2, m-1}, Hash mapping function can directly be realized m delivery by common hash function (as MD5, SHA, MurmurHash, BKDRHash etc.).General when inserting certain data element to a data acquisition, in Bloom Filter, use k Hash mapping function to carry out k Hash calculation to this data element accordingly, and will be that all binary digits of lower target are set to 1 with this k Hash calculation result in scale-of-two array.So, when inquire about certain data element whether in data acquisition time, only k Hash mapping function need be used to carry out k Hash calculation to this data element, then check the value with this k the binary digit that Hash calculation result is corresponding in the scale-of-two array of Bloom Filter, only have and just judge that this data element is in set when they are all 1.Generally speaking, Bloom Filter is the data structure that a space efficiency and time efficiency are all very high.But, if certain element is through k binary digit corresponding to k Hash calculation result, when being set to 1 by other elements inserted before in set, false positive erroneous judgement when inquiring about this element, can be there is.The size of the false positive False Rate of Bloom Filter, depends primarily on the factors such as the size of set, the length of scale-of-two array and the number of Hash mapping function.
, in efficiency, extensibility and performance etc., there is shortcoming in the more existing URL duplicate removal scheme based on Bloom Filter at present.As Internet Archive reptile uses the Bloom Filter of 32KB to store all URL of each website.Apoide reptile uses the Bloom Filter of 8KB to store all URL of each website, and decides a certain bar URL according to the cryptographic hash of website domain name and be specifically stored in which Bloom Filter.Generally speaking, the Bloom Filter organizational form in these schemes is relatively fixing.For some large-scale portal websites (as Sina website, the www.xinhuanet.com etc.), due to the URL Numerous under this domain name, the fixed size Bloom Filter being usually difficult to a use limited length stores; And for some small-scale websites, adopt excessive Bloom Filter certainly will cause the waste of storage space.In addition, some require higher information acquisition system to handling property, often in network distribution environment, dispose the Parallel Crawling that a large amount of web crawlers carries out webpage simultaneously.For the URL duplicate removal problem in this application scenarios, existing Bloom Filter and its some improvement projects are difficult to adapt to mostly.
Summary of the invention
Goal of the invention: for problems of the prior art with not enough, the invention provides a kind of web crawlers URL De-weight method based on Bloom Filter dynamically can be divided.The basis of the method is one and dynamically can divides Bloom Filter (brief note DSBF), it is different from the fixed sturcture Bloom Filter evenly bearing URL access task in Interner Archive reptile and Apoide reptile, but has the dynamic scalable structure that can split into multilayer as required flexibly.Web crawlers URL duplicate removal is realized based on Bloom Filter dynamically can be divided, both can ensure when processed URL number constantly increases, still can the false positive False Rate of Bloom Filter be controlled in given range, the flexible storage structure realized that distributes can be easy to again by allowing Bloom Filter have, thus be more suitable for constructing parallel processing environment that is extensive, distributed, Multi net voting reptile, support highly effective gathering and the process of internet mass info web.
Technical scheme: a kind of web crawlers URL De-weight method based on dynamically dividing Bloom Filter, comprising:
(1) dynamically can divide Bloom Filter and adopt tree-shaped hierarchical storage structure, dynamically can divide in BloomFilter and use two kinds of hash functions, be the Hash mapping function in classical Bloom Filter, another kind is the new Hash mapping function (being designated as LocHash) introduced.A Hash mapping function is deposited in tree root (being designated as the 0th layer).In each layer except tree root, deposit a classical Bloom Filter (being called blade Bloom Filter) in each leaf node, the URL information that it had crawled for recording web crawlers; Deposit a Hash mapping function in each non-leaf node, non-leaf node is led URL further by it the corresponding child node of its lower one deck.Dynamically can divide in Bloom Filter, for supporting the flexible division of leaf node, each blade Bloom Filter can associate a URL set, and its element has been inserted into all URL in this blade Bloom Filter.Relate generally to 4 kinds of operations based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, comprise initialization, insertion process, fission process and query script etc.
(2) initialization procedure of Bloom Filter can dynamically be divided.First, according to the discreet value of pending URL number, determine the number of the ground floor node set.Computing method are as follows: suppose to need the URL number stored to be n, and the scale-of-two array length in each blade Bloom Filter is m, and requires that false positive False Rate is lower than f.Then dynamically can divide the node number c of the ground floor of Bloom Filter:
In ground floor, each blade Bloom Filter is when false positive False Rate is not more than f, and the URL number u that most multipotency stores is:
And the Hash mapping function number k needed for each blade Bloom Filter is:
Then, select a codomain for 0,1,2 ..., the Hash mapping function of c-1}, leaves the root node of tree in by it.Finally, in the ground floor of tree, c blade Bloom Filter is placed.
(3) insertion process of Bloom Filter can dynamically be divided.When web crawlers URL information be inserted into dynamically can divide Bloom Filter time, need to carry out h (h >=1 to URL, h is the path in tree from root node to leaf node) Hash location Calculation, to determine which blade Bloom Filter is this URL information be specifically inserted in.Concrete insertion process is as follows: first, uses the Hash mapping function of root node to carry out Hash calculation to this URL, judges this URL information is stored in which child node of root node with this.Then, judge whether this child node is leaf node.If so, then this URL is inserted into during the URL associated with this blade BloomFilter gathers, by Hash mapping function, this URL information is inserted in the scale-of-two array of blade Bloom Filter simultaneously.If not, then use this non-leaf node Hash mapping function continue said process, until this URL is finally located and be inserted into a blade Bloom Filter and association thereof URL set till.
(4) fission process of Bloom Filter can dynamically be divided.When the URL quantity that certain blade Bloom Filter (might as well be designated as node l) that dynamically can divide Bloom Filter stores is greater than u, the false positive False Rate of this blade Bloom Filter will be made more than f, therefore need to divide this leaf node l.Concrete fission process is as follows: first, is waiting that the lower one deck dividing node l increases c blade Bloom Filter newly, and is making them become the child node of node l.Then, for treating that division node l increases a Hash mapping function.Then re-start Hash calculation (again Hash) with all URL during this Hash mapping function couple URL associated with node l gathers, and then make these URL each self-align and be inserted in the concrete corresponding child node (blade Bloom Filter) of node l.Finally, delete the URL set of waiting to divide blade Bloom Filter and the association thereof stored in node l, and only retain the Hash mapping function in node l.
(5) query script of Bloom Filter can dynamically be divided.When web crawlers is to when dynamically can divide BloomFilter inquiry URL information, its implementation is similar to insertion process.First, dynamically can divide BloomFilter and carry out h (h >=1) Hash location Calculation, to determine which blade Bloom Filter is this URL information be specifically stored in by the URL to be checked of input.Then according to the querying method of classical Bloom Filter, in this blade Bloom Filter, concrete query manipulation is performed to URL then.
(6) the choosing of Hash mapping function.Dynamically can divide Bloom Filter to organize by tree construction, it uses Hash mapping function to determine a URL is specifically stored in which node of tree.Bear store tasks equably to make each blade Bloom Filter of tree as far as possible, different Hash mapping functions (can select respectively from the hash functions such as MD5, SHA, MurmurHash, BKDRHash) should be selected, to avoid occurring too similar Hash result in layers in different layers; But within the same layer, the Hash mapping function of non-leaf node can be identical.Similar with the implementation of Hash mapping function, Hash mapping function also can be realized c delivery by common hash function.Usual m much larger than c (such as, when require False Rate lower than ten thousand/for the moment, if will store 1,000,000 URL, then the size of m is at least 2 × 10 7, and c generally can not more than 100), so Hash mapping function and Hash mapping function can be selected based on identical hash function, then realize respectively by c delivery with to m delivery.
Beneficial effect: dynamically can divide the advantage that Bloom Filter not only maintains classical Bloom Filter, there is outstanding time efficiency and space efficiency; And its tree-like storage structure of adopting, make it can split into multilayer as required flexibly and carry out dynamic expansion storage capacity.Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, by URL duplicate removal task is shared multiple blade Bloom Filter equably, be more suitable for constructing parallel processing environment that is extensive, distributed, Multi net voting reptile, support highly effective gathering and the process of internet mass info web.
Accompanying drawing explanation
Fig. 1 is a classical Bloom Filter;
Fig. 2 be the tree construction obtained through initialization dynamically can divide Bloom Filter;
Fig. 3 is in dynamically can divide Bloom Filter in fission process, and what division wherein occurred is the 4th node of ground floor, is designated as DSBF (root, 3), the node DSBF divided (root, 3)in, both remained with original blade Bloom Filter, and also increased the Hash mapping function positioned for dividing 5 nodes to the second layer newly simultaneously, be designated as LocHash (root, 3);
Fig. 4 dynamically can divide Bloom Filter, its node DSBF after fission process terminates (root, 3)in store Hash mapping function LocHash (root, 3), in subsequent operation, allly navigate to node DSBF from last layer (root, 3)uRL, all will again through Hash mapping function LocHash (root, 3)carry out Hash calculation, then by continuation guiding node DSBF (root, 3)next straton node.
Embodiment
Below in conjunction with specific embodiment, illustrate the present invention further, these embodiments should be understood only be not used in for illustration of the present invention and limit the scope of the invention, after having read the present invention, the amendment of those skilled in the art to the various equivalent form of value of the present invention has all fallen within the application's claims limited range.
Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, comprising:
(1) first will construct one and dynamically can divide Bloom Filter, the scale-of-two array of its each blade Bloom Filter adopts Redis database to store.Redis is the very superior memory database of readwrite performance, but the content stored close to or can performance rapid drawdown when exceeding memory size.So, first according to scale and the feature of web crawlers application, determine the length m of the scale-of-two array of applicable blade Bloom Filter, make it the memory size of the computing machine being less than operational network reptile.Dynamically can divide Bloom Filter and adopt tree-shaped hierarchical storage structure, as shown in Figure 2 and Figure 4.
Fig. 2 be the tree construction obtained through initialization dynamically can divide Bloom Filter.The root node of tree houses a Hash mapping function, and the ground floor of tree has 5 blade Bloom Filter, and each blade BloomFilter is a classical Bloom Filter.Web crawlers, when carrying out URL insertion or inquiry, first must determine the concrete memory location of this URL according to the Hash mapping function in root node.Fig. 4 dynamically can divide Bloom Filter, its node DSBF after fission process terminates (root, 3)in store Hash mapping function LocHash (root, 3).In subsequent operation, allly navigate to node DSBF from last layer (root, 3)uRL, all will again through Hash mapping function LocHash (root, 3)carry out Hash calculation, then by continuation guiding node DSBF (root, 3)next straton node.
(2) initialization procedure of Bloom Filter can dynamically be divided.First, according to the discreet value of pending URL number, determine the number of the ground floor node set.Computing method are as follows: suppose to need the URL number stored to be n, and the scale-of-two array length in each blade Bloom Filter is m, and requires that false positive False Rate is lower than f.Then dynamically can divide the node number c of the ground floor of Bloom Filter:
In ground floor, each blade Bloom Filter is when false positive False Rate is not more than f, and the URL number u that most multipotency stores is:
And the Hash mapping function number k needed for each blade Bloom Filter is:
Then, select a codomain for 0,1,2 ..., the Hash mapping function of c-1}, leaves the root node of tree in by it.Finally, in the ground floor of tree, c blade Bloom Filter is placed.
In conjunction with web crawlers application demand and crawl strategy, analyze the organization rule of Zhong Daipa website, internet and Web page, thus estimate the URL quantity n of the required process of web crawlers application.Then, according to m, n and false positive False Rate upper limit f, according to the description of (2) in technical scheme, calculate the parameters such as ground floor node number c, the Hash mapping function number k needed for each blade Bloom Filter that dynamically can divide BloomFilter and its maximum URL number u that can store.
The c=5 obtained by this procedure initialization dynamically can divide Bloom Filter as shown in Figure 2.
(3) according to the value of parameter m, choose k codomain for 0,1,2 ..., the hash function of m-1}, they serve as the Hash mapping function of ground floor c blade Bloom Filter.Hash mapping function can be realized m delivery by common hash function (as MD5, SHA, MurmurHash, BKDRHash etc.) again.Meanwhile, for the root node that dynamically can divide Bloom Filter choose a codomain for 0,1,2 ..., the Hash mapping function of c-1}, it can be realized c delivery by common hash function again.
(4) according to parameters such as above-mentioned m, n, f, c, k and u, and the k determined a Hash mapping function and 1 Hash mapping function, initialization obtains one dynamically can divide Bloom Filter, and its URL associated by each blade Bloom Filter gathers with database (as a MySQL) realization.
(5) by dynamically can dividing based on Bloom Filter of being constructed by preceding method, the URL duplicate removal of web crawlers is realized.Concrete grammar is as follows: when web crawlers is for crawling a URL, first inquires about this URL, to judge whether it was crawled dynamically can dividing in Bloom Filter.If so, then omit this URL (namely this URL is by duplicate removal), and again process next URL.If not, show that this URL dynamically can divide in Bloom Filter current the new URL do not stored, then now must arrive certain blade Bloom Filter.So, next first judge whether the URL quantity that this blade Bloom Filter stores reaches upper limit u.If so, then first according to the description of (4) in technical scheme, this blade Bloom Filter to be divided, then according to the description of (3) in technical scheme, URL is inserted into and dynamically can divides in Bloom Filter after division.If not, then the direct description according to (3) in technical scheme, is inserted into URL in this blade Bloom Filter.
(6) when using multiple web crawlers to carry out webpage parallel acquisition at network distribution environment, only need adopt suitable Distribution Strategy according to dynamically dividing Bloom Filter, such as allow multiple blade Bloom Filter (and association URL set) difference distributed store on multiple stage computing machine, each web crawlers just can be allowed to dispose separately on one computer.And when implementing URL duplicate removal, each web crawlers only need be responsible for the associative operation of a few (being generally one) blade Bloom Filter and association URL set (by database realizing) thereof respectively, this will improve the efficiency of URL duplicate removal greatly, and allows web crawlers for crawling webpage actual, gathering and the complex process such as analysis in the more a high proportion of processing time.

Claims (8)

1., based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, comprise the steps:
(1) first to construct one and dynamically can divide Bloom Filter, the scale-of-two array of its each blade Bloom Filter adopts Redis database to store, according to scale and the feature of web crawlers application, determine the length m of the scale-of-two array of applicable blade Bloom Filter, make it the memory size of the computing machine being less than operational network reptile;
(2) in conjunction with web crawlers application demand and crawl strategy, analyze the organization rule of Zhong Daipa website, internet and Web page, thus estimate the URL quantity n of the required process of web crawlers application; Then, according to m, n and false positive False Rate upper limit f, calculate ground floor node number c, the Hash mapping function number k needed for each blade Bloom Filter and its maximum URL number u that can store that dynamically can divide Bloom Filter;
(3) according to the value of parameter m, choose k codomain for 0,1,2 ..., the hash function of m-1}, they serve as the Hash mapping function of ground floor c blade Bloom Filter; Meanwhile, for the root node that dynamically can divide Bloom Filter choose a codomain for 0,1,2 ..., the Hash mapping function of c-1}, it can be realized c delivery by hash function again;
(4) according to above-mentioned m, n, f, c, k and u parameter, and the k determined a Hash mapping function and 1 Hash mapping function, initialization obtains one dynamically can divide Bloom Filter, and its URL associated by each blade Bloom Filter gathers with a database realizing;
(5) dynamically can divide based on Bloom Filter, the URL duplicate removal of web crawlers is realized;
(6) when using multiple web crawlers to carry out webpage parallel acquisition at network distribution environment, only need adopt Distribution Strategy according to dynamically dividing Bloom Filter, each web crawlers just can be allowed to dispose separately on one computer; And when implementing URL duplicate removal, each web crawlers only need be responsible for the associative operation of one or several blade Bloom Filter and association URL set thereof respectively.
2. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, dynamically can divide Bloom Filter and adopt tree-shaped hierarchical storage structure, dynamically can divide in Bloom Filter and use two kinds of hash functions, be the Hash mapping function in Bloom Filter, another kind is Hash mapping function; Deposit a Hash mapping function in tree root, in each layer except tree root, in each leaf node, depositing a Bloom Filter, being called blade Bloom Filter, the URL information that it had crawled for recording web crawlers; Deposit a Hash mapping function in each non-leaf node, non-leaf node is led URL further by it the corresponding child node of its lower one deck; Dynamically can divide in BloomFilter, for supporting the flexible division of leaf node, each blade Bloom Filter can associate a URL set, and its element has been inserted into all URL in this blade Bloom Filter.
3. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, dynamically can divide in the initialization procedure of Bloom Filter, first, according to the discreet value of pending URL number, determine the number of the ground floor node set; Computing method are as follows: suppose to need the URL number stored to be n, and the scale-of-two array length in each blade Bloom Filter is m, and requires that false positive False Rate is lower than f; Then dynamically can divide the node number c of the ground floor of Bloom Filter:
In ground floor, each blade Bloom Filter is when false positive False Rate is not more than f, and the URL number u that most multipotency stores is:
And the Hash mapping function number k needed for each blade Bloom Filter is:
Then, select a Hash mapping function, it is left in the root node of tree; Finally, in the ground floor of tree, c blade Bloom Filter is placed.
4. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, the concrete grammar realizing the URL duplicate removal of web crawlers is as follows: when web crawlers is for crawling a URL, first inquire about this URL, to judge whether it was crawled dynamically can dividing in Bloom Filter; If so, then omit this URL (namely this URL is by duplicate removal), and again process next URL; If not, show that this URL dynamically can divide in Bloom Filter current the new URL do not stored, then now must arrive certain blade Bloom Filter; So, next first judge whether the URL quantity that this blade Bloom Filter stores reaches upper limit u; If so, then first this blade Bloom Filter is divided, then URL is inserted into dynamically can divides in Bloom Filter after division; If not, then directly URL to be inserted in this blade Bloom Filter.
5. as claimed in claim 4 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, when web crawlers URL information be inserted into dynamically can divide Bloom Filter time, need to carry out h Hash location Calculation to URL, to determine which blade Bloom Filter is this URL information be specifically inserted in; Concrete insertion process is as follows: first, uses the Hash mapping function of root node to carry out Hash calculation to this URL, judges that this URL information is stored in which child node of root node with this; Then, judge whether this child node is leaf node; If so, then this URL is inserted into during the URL associated with this blade BloomFilter gathers, by Hash mapping function, this URL information is inserted in the scale-of-two array of blade Bloom Filter simultaneously; If not, then use this non-leaf node Hash mapping function continue said process, until this URL is finally located and be inserted into a blade Bloom Filter and association thereof URL set till.
6. as claimed in claim 4 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, dynamically can divide in the fission process of Bloom Filter, when the URL quantity that certain blade Bloom Filter (might as well be designated as node l) that dynamically can divide BloomFilter stores is greater than u, the false positive False Rate of this blade Bloom Filter will be made more than f, therefore need to divide this leaf node l; Concrete fission process is as follows: first, is waiting that the lower one deck dividing node l increases c blade BloomFilter newly, and is making them become the child node of node l; Then, for treating that division node l increases a Hash mapping function; Then re-start Hash calculation (again Hash) with all URL during this Hash mapping function couple URL associated with node l gathers, and then make these URL each self-align and be inserted in the concrete corresponding child node of node l; Finally, delete the URL set of waiting to divide blade Bloom Filter and the association thereof stored in node l, and only retain the Hash mapping function in node l.
7. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, when web crawlers is to when dynamically can divide Bloom Filter inquiry URL information, its implementation is similar to insertion process; First, dynamically can divide Bloom Filter and carry out h Hash location Calculation, to determine which blade BloomFilter is this URL information be specifically stored in by the URL to be checked of input; Then according to the querying method of classical Bloom Filter, in this blade Bloom Filter, concrete query manipulation is performed to URL then.
8. as claimed in claim 1 based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter, it is characterized in that, dynamically can divide Bloom Filter to organize by tree construction, it uses Hash mapping function to determine that a URL is specifically stored in which node of tree; In order to make each blade BloomFilter of tree bear store tasks equably as far as possible, different Hash mapping functions should be selected in different layers, to avoid occurring too similar Hash result in layers; But within the same layer, the Hash mapping function of non-leaf node can be identical; Similar with the implementation of Hash mapping function, Hash mapping function also can be realized c delivery by common hash function; Usual m is much larger than c, so Hash mapping function and Hash mapping function can be selected based on identical hash function, then realizes respectively by c delivery with to m delivery.
CN201510185467.2A 2015-04-17 2015-04-17 Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter Active CN104809182B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510185467.2A CN104809182B (en) 2015-04-17 2015-04-17 Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510185467.2A CN104809182B (en) 2015-04-17 2015-04-17 Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter

Publications (2)

Publication Number Publication Date
CN104809182A true CN104809182A (en) 2015-07-29
CN104809182B CN104809182B (en) 2016-08-17

Family

ID=53694004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510185467.2A Active CN104809182B (en) 2015-04-17 2015-04-17 Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter

Country Status (1)

Country Link
CN (1) CN104809182B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096008A (en) * 2016-06-23 2016-11-09 北京工业大学 A kind of web crawlers method for finance warehouse receipt wind control
CN106570025A (en) * 2015-10-10 2017-04-19 北京国双科技有限公司 Data filtering method and device
CN106886602A (en) * 2017-03-02 2017-06-23 上海斐讯数据通信技术有限公司 A kind of application crawler method and system
CN107798106A (en) * 2017-10-31 2018-03-13 广东思域信息科技有限公司 A kind of URL De-weight methods in distributed reptile system
CN107844527A (en) * 2017-10-13 2018-03-27 平安科技(深圳)有限公司 Web page address De-weight method, electronic equipment and computer-readable recording medium
CN108628871A (en) * 2017-03-16 2018-10-09 哈尔滨英赛克信息技术有限公司 A kind of link De-weight method based on chain feature
CN108804242A (en) * 2018-05-23 2018-11-13 武汉斗鱼网络科技有限公司 A kind of data counts De-weight method, system, server and storage medium
CN108874941A (en) * 2018-06-04 2018-11-23 成都知道创宇信息技术有限公司 Big data URL De-weight method based on convolution feature and multiple Hash mapping
CN109635182A (en) * 2018-12-21 2019-04-16 全通教育集团(广东)股份有限公司 Parallelization data tracking method based on educational information theme
CN111666267A (en) * 2019-03-05 2020-09-15 国家计算机网络与信息安全管理中心 Data cleaning method and device and terminal equipment
CN111930923A (en) * 2020-07-02 2020-11-13 上海微亿智造科技有限公司 Bloom filter system and filtering method
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN113055829A (en) * 2021-03-16 2021-06-29 深圳职业技术学院 Privacy protection method and device for network broadcast information and readable storage medium
US11741258B2 (en) 2021-04-16 2023-08-29 International Business Machines Corporation Dynamic data dissemination under declarative data subject constraints

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐娜等: "基于 Bloom Filter 的网页去重算法", 《微电脑应用》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570025B (en) * 2015-10-10 2020-09-11 北京国双科技有限公司 Data filtering method and device
CN106570025A (en) * 2015-10-10 2017-04-19 北京国双科技有限公司 Data filtering method and device
CN106096008B (en) * 2016-06-23 2021-01-05 北京工业大学 Web crawler method for financial warehouse receipt wind control
CN106096008A (en) * 2016-06-23 2016-11-09 北京工业大学 A kind of web crawlers method for finance warehouse receipt wind control
CN106886602A (en) * 2017-03-02 2017-06-23 上海斐讯数据通信技术有限公司 A kind of application crawler method and system
CN108628871A (en) * 2017-03-16 2018-10-09 哈尔滨英赛克信息技术有限公司 A kind of link De-weight method based on chain feature
CN107844527A (en) * 2017-10-13 2018-03-27 平安科技(深圳)有限公司 Web page address De-weight method, electronic equipment and computer-readable recording medium
CN107798106A (en) * 2017-10-31 2018-03-13 广东思域信息科技有限公司 A kind of URL De-weight methods in distributed reptile system
CN108804242A (en) * 2018-05-23 2018-11-13 武汉斗鱼网络科技有限公司 A kind of data counts De-weight method, system, server and storage medium
CN108804242B (en) * 2018-05-23 2022-03-22 武汉斗鱼网络科技有限公司 Data counting and duplicate removal method, system, server and storage medium
CN108874941A (en) * 2018-06-04 2018-11-23 成都知道创宇信息技术有限公司 Big data URL De-weight method based on convolution feature and multiple Hash mapping
CN108874941B (en) * 2018-06-04 2021-09-21 成都知道创宇信息技术有限公司 Big data URL duplication removing method based on convolution characteristics and multiple Hash mapping
CN109635182A (en) * 2018-12-21 2019-04-16 全通教育集团(广东)股份有限公司 Parallelization data tracking method based on educational information theme
CN111666267A (en) * 2019-03-05 2020-09-15 国家计算机网络与信息安全管理中心 Data cleaning method and device and terminal equipment
CN111930923A (en) * 2020-07-02 2020-11-13 上海微亿智造科技有限公司 Bloom filter system and filtering method
CN111930923B (en) * 2020-07-02 2021-07-30 上海微亿智造科技有限公司 Bloom filter system and filtering method
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN113055829A (en) * 2021-03-16 2021-06-29 深圳职业技术学院 Privacy protection method and device for network broadcast information and readable storage medium
US11741258B2 (en) 2021-04-16 2023-08-29 International Business Machines Corporation Dynamic data dissemination under declarative data subject constraints

Also Published As

Publication number Publication date
CN104809182B (en) 2016-08-17

Similar Documents

Publication Publication Date Title
CN104809182A (en) Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
US20190278783A1 (en) Compaction policy
CN102968498B (en) Data processing method and device
Ahn et al. ForestDB: A fast key-value storage system for variable-length string keys
CN102375853A (en) Distributed database system, method for building index therein and query method
CN104077423A (en) Consistent hash based structural data storage, inquiry and migration method
Tang et al. Deferred lightweight indexing for log-structured key-value stores
CN104111924A (en) Database system
CN103246549B (en) A kind of method and system of data conversion storage
CN107798106A (en) A kind of URL De-weight methods in distributed reptile system
Ibrahim et al. Intelligent data placement mechanism for replicas distribution in cloud storage systems
CN104424219A (en) Method and equipment of managing data documents
CN106156319A (en) Telescopic distributed resource description framework data storage method and device
CN102081649A (en) Method and system for searching computer files
CN106709010A (en) High-efficient HDFS uploading method based on massive small files and system thereof
Challa et al. DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster nodes for spatial data mining algorithms
CN102646133B (en) Two-dimensional table organization method based on metadata storage
CN102411632A (en) Chain table-based memory database page type storage method
CN106776617A (en) The store method and device of journal file
CN104537016A (en) Method and device for determining zones where files are located
He et al. SLC-index: A scalable skip list-based index for cloud data processing
CN113326262B (en) Data processing method, device, equipment and medium based on key value database
CN104537023A (en) Storage method and device for reverse index records
CN105468599A (en) Metadata hierarchy management method for storage virtualization system
CN102968467A (en) Optimization method and query method for multiple layers of Bloom Filters

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant