CN107798106B - URL duplication removing method in distributed crawler system - Google Patents

URL duplication removing method in distributed crawler system Download PDF

Info

Publication number
CN107798106B
CN107798106B CN201711047215.9A CN201711047215A CN107798106B CN 107798106 B CN107798106 B CN 107798106B CN 201711047215 A CN201711047215 A CN 201711047215A CN 107798106 B CN107798106 B CN 107798106B
Authority
CN
China
Prior art keywords
bloom filter
url
hash
node
service node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711047215.9A
Other languages
Chinese (zh)
Other versions
CN107798106A (en
Inventor
曾映方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Siyu Information Technology Co ltd
Original Assignee
Guangdong Siyu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Siyu Information Technology Co ltd filed Critical Guangdong Siyu Information Technology Co ltd
Priority to CN201711047215.9A priority Critical patent/CN107798106B/en
Publication of CN107798106A publication Critical patent/CN107798106A/en
Application granted granted Critical
Publication of CN107798106B publication Critical patent/CN107798106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a URL duplication eliminating method in a distributed crawler system, which comprises the step of putting a Hash value into 2 m Each node processes a segment on a continuous Hash ring, and each node corresponds to a Bloom Filter structure. The URL duplicate removal process comprises the steps of calculating Hash to obtain a corresponding server node, and judging whether the corresponding server node exists according to the content of the Bloom Filter. The method adopts a mode of combining consistent Hash and Bloom filters, can dynamically increase Bloom Filter nodes according to needs, can ensure that the false positive misjudgment rate of the Bloom filters is controlled within a given range when the number of URLs is continuously increased, can fully utilize the high efficiency of the Bloom filters, is suitable for constructing large-scale distributed network crawlers, and supports high-efficiency capture of mass webpage contents.

Description

URL duplication removing method in distributed crawler system
Technical Field
The invention relates to the technical field of networks, in particular to a URL duplicate removal method in a distributed crawler system.
Background
The web crawler is a program for automatically capturing web content according to a certain rule, and is widely applied to the field of internet. The crawler starts to download the page content from the appointed URL address, extracts the URL address therein, and then starts to continuously download the content from the URL addresses. Since the newly extracted URL address may have already been processed, continuing processing may result in repeated downloads, wasting bandwidth and computing resources. The URL duplication removal technology is widely applied to network auditing systems and search engine systems. In a distributed web crawler system, a certain strategy is required to be adopted to distribute URL tasks to a plurality of servers for parallel grabbing, and the division strategy must be efficient and easy to implement. In a distributed environment, the URL address extracted by a host may have been processed by other hosts in the system, and therefore the system needs a distributed URL deduplication mechanism. The URL deduplication technology mainly considers two problems: the URL storage space and the URL matching speed. The storage space of the URL refers to the maximum number of non-duplicate URLs that can be handled and the memory space occupied by each URL. URL match speed is measured by the time it takes to determine whether a URL record is a duplicate URL.
Bloom Filter is an efficient tool for handling URL deduplication. The main idea of the deduplication scheme of the BloomFilter algorithm roughly includes: and mapping the same URL to different bits in the same bit array through a plurality of different Hash calculation functions, and identifying the acquisition state of the URL (whether the URL is acquired or not) according to the states of the plurality of different bits in the bit array. The BloomFilter algorithm has the advantages that the acquisition state of the URL can be judged only by storing a data structure such as a bit array in a memory, specific URLs do not need to be stored, the occupied storage space is small, and meanwhile, the searching and calculating speed is high. However, when determining whether an element belongs to a set, the BloomFilter algorithm may misinterpret elements that do not belong to the set as belonging to the set. Therefore, the BloomFilter algorithm has the defects of inaccuracy and certain error.
At present, some existing URL deduplication schemes based on Bloom filters are deficient in the aspects of efficiency, expandability, performance and the like. For example, an Internet Archive crawler uses a Bloom Filter of 32KB to store all the URLs for each web site. The amide crawler uses 8KB Bloom filters to store all URLs of each website, and decides which Bloom Filter a URL is specifically stored in according to the hash value of the website domain name. In general, the Bloom Filter organization scheme in these schemes is relatively fixed. For some large-scale web portals (such as the newwave web, the newcastle web, etc.), because the number of URLs under the domain name is large, it is often difficult to store the URL using a Bloom Filter with a fixed size and a limited length; for some small-scale websites, too large Bloom filters are used, which inevitably causes waste of storage space. In addition, some information acquisition systems with high requirements on processing performance often deploy a large number of web crawlers in a network distribution environment to perform parallel crawling of web pages. For the URL deduplication problem in such application scenarios, most of the existing Bloom Filter and some of its improvements are difficult to adapt.
Consistent hashing is a special hashing algorithm. After using a consistent hash algorithm, the change in hash table slot number (size) requires only remapping of K/n keys on average, where K is the number of keys and n is the number of slots. However, in a conventional hash table, adding or deleting a slot requires remapping of almost all keys. Consistent hashing maps each object to a point on the edge of the torus, and the system then maps the available node machines to different locations on the torus. When a machine corresponding to an object is searched, the position on the ring edge corresponding to the object needs to be obtained through calculation by using a consistent hash algorithm, and the machine is the position which the object should store until a certain node machine is encountered along the ring edge. When a node machine is deleted, all objects stored on that machine are moved to the next machine. When a machine is added to a point on the edge of the circle, the next machine at that point needs to move the object corresponding to the point to the new machine. Altering the distribution of objects on the node machines may be accomplished by adjusting the location of the node machines.
Disclosure of Invention
Aiming at the problems and the defects in the prior art, the invention provides the URL duplication eliminating method in the distributed crawler system, which can ensure that the false positive misjudgment rate can be controlled at a certain threshold value along with the continuous increase of the number of the URLs, and can fully utilize the high efficiency of the Bloom Filter, thereby being more suitable for constructing the large-scale distributed network crawler system.
In order to achieve the purpose, the invention provides the following technical scheme:
a URL duplicate removal method in a distributed crawler system comprises the following steps:
s1, taking a server cluster as a uniform resource pool, and putting a Hash value into one
Figure SMS_1
In the Hash annular space, each service node is also used as an object to be placed in the Hash ring, each service node corresponds to a Bloom Filter structure, and each service node processes a request in a corresponding range;
s2, each node initializes a Bloom Filter structure, namely an array with the length of n bits is initialized, and the initial value of all the bits is 0;
s3, carrying out Hash calculation on the newly obtained URL to obtain H;
s4, obtaining a corresponding service node b according to the position of the H falling on the hash ring;
s5, the corresponding service node b calculates the URL by using k Hash functions to obtain k Hash values H [0], H [1],. An, H [ k-1];
s6, searching a bitmap in the Bloom Filter according to the k Hash values, judging whether the corresponding bitmaps are all 1, if so, considering the URL as repeated, and entering a step S7, otherwise, entering a step S8;
s7, discarding repeated URLs, and entering a step S3;
s8, putting the URL into a queue to be processed of the crawler;
s9, setting all the corresponding H [0], H [1],. And H [ k-1] bit positions in the Bloom Filter of the service node b as 1;
s10, recording an insertion log, wherein the content is H, H0, H1, the.
S11, judging whether the utilization rate of the Bloom Filter reaches a threshold value, if not, entering a step S3, otherwise, entering a step S12;
s12, adding nodes, newly generating a node b +1, migrating Bloom Filter contents corresponding to the nodes, namely, replaying a written log corresponding to the nodes, wherein corresponding H values belong to the contents of the node b +1, and setting all bit positions of H [0], H [1],. And H [ k-1] corresponding to the Bloom Filter of the node b +1 as 1;
s13, the corresponding service node b performs the operation similar to the operation in the step S12 to generate a new Bloom Filter, replaces the original Bloom Filter after the operation is completed, and enters the step S3;
in the Bloom Filter structure, the false positive probability is
Figure SMS_2
,/>
Where k is the hash function number, a is the total number of bits, and c is the number of the inserted elements, then given f, k, and a, the maximum number of elements c allowed to be inserted is:
Figure SMS_3
compared with the prior art, the invention has the following beneficial effects:
the method adopts a mode of combining consistent Hash and Bloom filters, is different from a preset Bloom Filter, can dynamically increase Bloom Filter nodes according to needs, can ensure that the false positive rate of the Bloom Filter is controlled within a given range when the number of URLs is continuously increased, can fully utilize the high efficiency of the Bloom Filter, is suitable for constructing large-scale distributed network crawlers, and supports high-efficiency capture of mass webpage contents.
Drawings
Fig. 1 is a flowchart of a URL deduplication method in a distributed crawler system according to the present invention.
FIG. 2 is a schematic flow chart of a Bloom Filter structure obtained through initialization;
fig. 3 is a schematic diagram of a Bloom Filter structure flow processed by a service node B when a hash calculation is performed on a URL to obtain H which falls on a hash ring and is located in an AB segment.
FIG. 4 is a flowchart illustrating the URL processing by the serving node B.
Fig. 5 is a schematic view of a Bloom Filter structure flow for adding a service node.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution:
the technical scheme of the invention is further explained in detail by combining the attached drawings and the detailed implementation mode.
Referring to fig. 1, a URL deduplication method in a distributed crawler system includes the main steps of initializing a consistent hash ring and deduplication of a URL, and adding a service node operation. Distributing service nodes to a hash ring, wherein each service node is responsible for processing a segment of hash value, corresponding to a given URL, firstly performing hash calculation to obtain H, falling on the hash ring, judging whether the URL is repeated by the corresponding service node, discarding if the URL is repeated, or inserting the URL into a Bloom Filter. And when the number of the inserted records on the service node reaches a threshold value, the newly added service node shares the service pressure.
1. According to the initial setting of the number N of the service nodes, each service node corresponds to a Bloom Filter structure, and the Bloom filters are initialized, namely a bit array is distributed in the memory, the size of the bit array is N, and all the bit arrays are set to be 0. As shown in fig. 1, where N =3, the large circle is a hash circle with a value range of 0-
Figure SMS_4
The service nodes A, B and C are corresponding Hash and fall on a Hash ring, the node A is responsible for processing a CA section, the node B is responsible for an AB section, and the node C is responsible for a BC section (only three service nodes are used for convenience of illustration, and the number of the initial service nodes can be set according to needs in practical application).
2. And performing hash calculation on the given URL, falling onto a hash ring, and finding a corresponding service node. As shown in fig. 3, the URL is hash-calculated to obtain H that falls on the hash ring and is located in the AB segment, and is processed by the serving node B.
3. The service node B processes the URL, calculates the URL by adopting k Hash functions to obtain H [0], H [1],. H [ k-1], and judges whether corresponding bits in a Bloom Filter structure are all 1. If the number of the URL is 1, returning a repeated URL and discarding the URL, otherwise, returning a non-repeated URL, adding the URL into a crawler task queue, recording an insertion log in addition to the next time, and increasing the insertion count by 1.
4. A service node is added. When the number of inserted records exceeds the threshold (set to 0.9 × c), the service nodes need to be added. At this time, the position of the node needs to be calculated and determined, then the corresponding Bloom Filter bitmap information is imported, and then the serving node is added into the hash ring. As shown in the following figure, since the service node B insertion record reaches the threshold value, the new node D is required to share the service pressure, after the node D is inserted, the node D processes the AD segment request, and the node B processes the DB segment request. The service node D needs to load the existing bitmap before starting service, namely, the write log of the service node B is re-imported, and for the record of the H value belonging to the AD section, the corresponding H0, H1, and H k-1 are set to be 1. The node B can re-import the write record in the background, and for the record of which the H value belongs to the DB segment, the corresponding H [0], H [1],. And H [ k-1] are positioned as 1, and the previous Bloom Filter can be replaced after the import is finished. This completes the operation of adding the service node.

Claims (1)

1. A URL duplication removing method in a distributed crawler system is characterized by comprising the following steps:
s1, taking a server cluster as a uniform resource pool, and putting a Hash value into one
Figure QLYQS_1
In the Hash annular space, each service node is also used as an object to be placed in the Hash ring, each service node corresponds to a Bloom Filter structure, and each service node processes a request in a corresponding range;
s2, each node initializes a Bloom Filter structure, namely an array with the length of n bits is initialized, and the initial value of all the bits is 0;
s3, carrying out Hash calculation on the newly acquired URL to obtain H;
s4, obtaining a corresponding service node b according to the position of the H falling on the hash ring;
s5, the corresponding service node b calculates the URL by using k Hash functions to obtain k Hash values H [0], H [1],.
S6, searching a bitmap in the Bloom Filter according to the k Hash values, judging whether the corresponding bitmaps are all 1, if so, considering the URL as repeated, and entering a step S7, otherwise, entering a step S8;
s7, discarding repeated URLs, and entering a step S3;
s8, putting the URL into a queue to be processed of the crawler;
s9, setting all the corresponding H [0], H [1],. And H [ k-1] bit positions in the Bloom Filter of the service node b as 1;
s10, recording an insertion log, wherein the content is H, H0, H1, the.
S11, judging whether the utilization rate of the Bloom Filter reaches a threshold value, if not, entering a step S3, otherwise, entering a step S12;
s12, adding nodes, newly generating a node b +1, migrating Bloom Filter contents corresponding to the nodes, namely, replaying a written log corresponding to the nodes, wherein corresponding H values belong to the contents of the node b +1, and setting all bit positions of H [0], H [1],. And H [ k-1] corresponding to the Bloom Filter of the node b +1 as 1;
s13, the corresponding service node b performs the operation of the step S12 to generate a new Bloom Filter, replaces the original Bloom Filter after the operation is completed, and enters a step S3;
in the Bloom Filter structure, the false positive probability is
Figure QLYQS_2
Where k is the hash function number, a is the total number of bits, and c is the number of inserted elements, then given f, k, a, the maximum number of elements allowed to be inserted Cmax is:
Figure QLYQS_3
。/>
CN201711047215.9A 2017-10-31 2017-10-31 URL duplication removing method in distributed crawler system Active CN107798106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711047215.9A CN107798106B (en) 2017-10-31 2017-10-31 URL duplication removing method in distributed crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711047215.9A CN107798106B (en) 2017-10-31 2017-10-31 URL duplication removing method in distributed crawler system

Publications (2)

Publication Number Publication Date
CN107798106A CN107798106A (en) 2018-03-13
CN107798106B true CN107798106B (en) 2023-04-18

Family

ID=61547687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711047215.9A Active CN107798106B (en) 2017-10-31 2017-10-31 URL duplication removing method in distributed crawler system

Country Status (1)

Country Link
CN (1) CN107798106B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959359B (en) * 2018-05-16 2022-10-11 顺丰科技有限公司 Uniform Resource Locator (URL) semantic deduplication method, device, equipment and medium
CN108804242B (en) * 2018-05-23 2022-03-22 武汉斗鱼网络科技有限公司 Data counting and duplicate removal method, system, server and storage medium
CN109933739A (en) * 2019-03-01 2019-06-25 重庆邮电大学移通学院 A kind of Web page sequencing method and system based on transition probability
CN110399546B (en) * 2019-07-23 2022-02-08 中南民族大学 Link duplicate removal method, device, equipment and storage medium based on web crawler
CN110673968A (en) * 2019-09-26 2020-01-10 科大国创软件股份有限公司 Token ring-based public opinion monitoring target protection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system
CN105653629A (en) * 2015-12-28 2016-06-08 湖南蚁坊软件有限公司 Hash ring-based distributed data filter method
WO2017113324A1 (en) * 2015-12-31 2017-07-06 孙燕群 Regular expression-based url filtering method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058B (en) * 2012-03-30 2013-12-18 华中科技大学 URL duplication removing method in distributed network crawler system
CN104809182B (en) * 2015-04-17 2016-08-17 东南大学 Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database
CN107145556B (en) * 2017-04-28 2020-12-29 安徽博约信息科技股份有限公司 Universal distributed acquisition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system
CN105653629A (en) * 2015-12-28 2016-06-08 湖南蚁坊软件有限公司 Hash ring-based distributed data filter method
WO2017113324A1 (en) * 2015-12-31 2017-07-06 孙燕群 Regular expression-based url filtering method

Also Published As

Publication number Publication date
CN107798106A (en) 2018-03-13

Similar Documents

Publication Publication Date Title
CN107798106B (en) URL duplication removing method in distributed crawler system
US11216443B2 (en) Processing device configured for data integrity testing utilizing signature-based multi-phase write operations
US10817385B2 (en) Storage system with backup control utilizing content-based signatures
CN104376053B (en) A kind of storage and retrieval method based on magnanimity meteorological data
US20140195551A1 (en) Optimizing snapshot lookups
CN109819039B (en) File acquisition method, file storage method, server and storage medium
CN103581331B (en) The online moving method of virtual machine and system
CN107562757B (en) Query and access method, device and system based on distributed file system
CN104239575A (en) Virtual machine mirror image file storage and distribution method and device
CN102467572B (en) Data block inquiring method for supporting data de-duplication program
US20170249218A1 (en) Data to be backed up in a backup system
CN106407207B (en) Real-time newly-added data updating method and device
JP6968876B2 (en) Expired backup processing method and backup server
CN110532201B (en) Metadata processing method and device
CN101944124A (en) Distributed file system management method, device and corresponding file system
CN111090618B (en) Data reading method, system and equipment
CN109033360B (en) Data query method, device, server and storage medium
CN103198097A (en) Massive geoscientific data parallel processing method based on distributed file system
CN104834648A (en) Log query method and device
CN112579595A (en) Data processing method and device, electronic equipment and readable storage medium
CN103369002A (en) A resource downloading method and system
CN102523301A (en) Method for caching data on client in cloud storage
CN106354587A (en) Mirror image server and method for exporting mirror image files of virtual machine
CN114969061A (en) Distributed storage method and device for industrial time sequence data
CN103049561A (en) Data compressing method, storage engine and storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A URL Deduplication Method in Distributed Crawler System

Effective date of registration: 20230625

Granted publication date: 20230418

Pledgee: Dongguan Kechuang Financing Guarantee Co.,Ltd.

Pledgor: GUANGDONG SIYU INFORMATION TECHNOLOGY CO.,LTD.

Registration number: Y2023980045419