CN107798106B

CN107798106B - URL duplication removing method in distributed crawler system

Info

Publication number: CN107798106B
Application number: CN201711047215.9A
Authority: CN
Inventors: 曾映方
Original assignee: Guangdong Siyu Information Technology Co ltd
Current assignee: Guangdong Siyu Information Technology Co ltd
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2023-04-18
Anticipated expiration: 2037-10-31
Also published as: CN107798106A

Abstract

The invention discloses a URL duplication eliminating method in a distributed crawler system, which comprises the step of putting a Hash value into 2 ^m Each node processes a segment on a continuous Hash ring, and each node corresponds to a Bloom Filter structure. The URL duplicate removal process comprises the steps of calculating Hash to obtain a corresponding server node, and judging whether the corresponding server node exists according to the content of the Bloom Filter. The method adopts a mode of combining consistent Hash and Bloom filters, can dynamically increase Bloom Filter nodes according to needs, can ensure that the false positive misjudgment rate of the Bloom filters is controlled within a given range when the number of URLs is continuously increased, can fully utilize the high efficiency of the Bloom filters, is suitable for constructing large-scale distributed network crawlers, and supports high-efficiency capture of mass webpage contents.

Description

URL duplication removing method in distributed crawler system

Technical Field

The invention relates to the technical field of networks, in particular to a URL duplicate removal method in a distributed crawler system.

Background

The web crawler is a program for automatically capturing web content according to a certain rule, and is widely applied to the field of internet. The crawler starts to download the page content from the appointed URL address, extracts the URL address therein, and then starts to continuously download the content from the URL addresses. Since the newly extracted URL address may have already been processed, continuing processing may result in repeated downloads, wasting bandwidth and computing resources. The URL duplication removal technology is widely applied to network auditing systems and search engine systems. In a distributed web crawler system, a certain strategy is required to be adopted to distribute URL tasks to a plurality of servers for parallel grabbing, and the division strategy must be efficient and easy to implement. In a distributed environment, the URL address extracted by a host may have been processed by other hosts in the system, and therefore the system needs a distributed URL deduplication mechanism. The URL deduplication technology mainly considers two problems: the URL storage space and the URL matching speed. The storage space of the URL refers to the maximum number of non-duplicate URLs that can be handled and the memory space occupied by each URL. URL match speed is measured by the time it takes to determine whether a URL record is a duplicate URL.

Bloom Filter is an efficient tool for handling URL deduplication. The main idea of the deduplication scheme of the BloomFilter algorithm roughly includes: and mapping the same URL to different bits in the same bit array through a plurality of different Hash calculation functions, and identifying the acquisition state of the URL (whether the URL is acquired or not) according to the states of the plurality of different bits in the bit array. The BloomFilter algorithm has the advantages that the acquisition state of the URL can be judged only by storing a data structure such as a bit array in a memory, specific URLs do not need to be stored, the occupied storage space is small, and meanwhile, the searching and calculating speed is high. However, when determining whether an element belongs to a set, the BloomFilter algorithm may misinterpret elements that do not belong to the set as belonging to the set. Therefore, the BloomFilter algorithm has the defects of inaccuracy and certain error.

At present, some existing URL deduplication schemes based on Bloom filters are deficient in the aspects of efficiency, expandability, performance and the like. For example, an Internet Archive crawler uses a Bloom Filter of 32KB to store all the URLs for each web site. The amide crawler uses 8KB Bloom filters to store all URLs of each website, and decides which Bloom Filter a URL is specifically stored in according to the hash value of the website domain name. In general, the Bloom Filter organization scheme in these schemes is relatively fixed. For some large-scale web portals (such as the newwave web, the newcastle web, etc.), because the number of URLs under the domain name is large, it is often difficult to store the URL using a Bloom Filter with a fixed size and a limited length; for some small-scale websites, too large Bloom filters are used, which inevitably causes waste of storage space. In addition, some information acquisition systems with high requirements on processing performance often deploy a large number of web crawlers in a network distribution environment to perform parallel crawling of web pages. For the URL deduplication problem in such application scenarios, most of the existing Bloom Filter and some of its improvements are difficult to adapt.

Consistent hashing is a special hashing algorithm. After using a consistent hash algorithm, the change in hash table slot number (size) requires only remapping of K/n keys on average, where K is the number of keys and n is the number of slots. However, in a conventional hash table, adding or deleting a slot requires remapping of almost all keys. Consistent hashing maps each object to a point on the edge of the torus, and the system then maps the available node machines to different locations on the torus. When a machine corresponding to an object is searched, the position on the ring edge corresponding to the object needs to be obtained through calculation by using a consistent hash algorithm, and the machine is the position which the object should store until a certain node machine is encountered along the ring edge. When a node machine is deleted, all objects stored on that machine are moved to the next machine. When a machine is added to a point on the edge of the circle, the next machine at that point needs to move the object corresponding to the point to the new machine. Altering the distribution of objects on the node machines may be accomplished by adjusting the location of the node machines.

Disclosure of Invention

Aiming at the problems and the defects in the prior art, the invention provides the URL duplication eliminating method in the distributed crawler system, which can ensure that the false positive misjudgment rate can be controlled at a certain threshold value along with the continuous increase of the number of the URLs, and can fully utilize the high efficiency of the Bloom Filter, thereby being more suitable for constructing the large-scale distributed network crawler system.

In order to achieve the purpose, the invention provides the following technical scheme:

a URL duplicate removal method in a distributed crawler system comprises the following steps:

s1, taking a server cluster as a uniform resource pool, and putting a Hash value into one

In the Hash annular space, each service node is also used as an object to be placed in the Hash ring, each service node corresponds to a Bloom Filter structure, and each service node processes a request in a corresponding range;

s2, each node initializes a Bloom Filter structure, namely an array with the length of n bits is initialized, and the initial value of all the bits is 0;

s3, carrying out Hash calculation on the newly obtained URL to obtain H;

s4, obtaining a corresponding service node b according to the position of the H falling on the hash ring;

s5, the corresponding service node b calculates the URL by using k Hash functions to obtain k Hash values H [0], H [1],. An, H [ k-1];

s6, searching a bitmap in the Bloom Filter according to the k Hash values, judging whether the corresponding bitmaps are all 1, if so, considering the URL as repeated, and entering a step S7, otherwise, entering a step S8;

s7, discarding repeated URLs, and entering a step S3;

s8, putting the URL into a queue to be processed of the crawler;

s9, setting all the corresponding H [0], H [1],. And H [ k-1] bit positions in the Bloom Filter of the service node b as 1;

s10, recording an insertion log, wherein the content is H, H0, H1, the.

S11, judging whether the utilization rate of the Bloom Filter reaches a threshold value, if not, entering a step S3, otherwise, entering a step S12;

s12, adding nodes, newly generating a node b +1, migrating Bloom Filter contents corresponding to the nodes, namely, replaying a written log corresponding to the nodes, wherein corresponding H values belong to the contents of the node b +1, and setting all bit positions of H [0], H [1],. And H [ k-1] corresponding to the Bloom Filter of the node b +1 as 1;

s13, the corresponding service node b performs the operation similar to the operation in the step S12 to generate a new Bloom Filter, replaces the original Bloom Filter after the operation is completed, and enters the step S3;

in the Bloom Filter structure, the false positive probability is

，/>

Where k is the hash function number, a is the total number of bits, and c is the number of the inserted elements, then given f, k, and a, the maximum number of elements c allowed to be inserted is:

。

compared with the prior art, the invention has the following beneficial effects:

the method adopts a mode of combining consistent Hash and Bloom filters, is different from a preset Bloom Filter, can dynamically increase Bloom Filter nodes according to needs, can ensure that the false positive rate of the Bloom Filter is controlled within a given range when the number of URLs is continuously increased, can fully utilize the high efficiency of the Bloom Filter, is suitable for constructing large-scale distributed network crawlers, and supports high-efficiency capture of mass webpage contents.

Drawings

Fig. 1 is a flowchart of a URL deduplication method in a distributed crawler system according to the present invention.

FIG. 2 is a schematic flow chart of a Bloom Filter structure obtained through initialization;

fig. 3 is a schematic diagram of a Bloom Filter structure flow processed by a service node B when a hash calculation is performed on a URL to obtain H which falls on a hash ring and is located in an AB segment.

FIG. 4 is a flowchart illustrating the URL processing by the serving node B.

Fig. 5 is a schematic view of a Bloom Filter structure flow for adding a service node.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution:

the technical scheme of the invention is further explained in detail by combining the attached drawings and the detailed implementation mode.

Referring to fig. 1, a URL deduplication method in a distributed crawler system includes the main steps of initializing a consistent hash ring and deduplication of a URL, and adding a service node operation. Distributing service nodes to a hash ring, wherein each service node is responsible for processing a segment of hash value, corresponding to a given URL, firstly performing hash calculation to obtain H, falling on the hash ring, judging whether the URL is repeated by the corresponding service node, discarding if the URL is repeated, or inserting the URL into a Bloom Filter. And when the number of the inserted records on the service node reaches a threshold value, the newly added service node shares the service pressure.

1. According to the initial setting of the number N of the service nodes, each service node corresponds to a Bloom Filter structure, and the Bloom filters are initialized, namely a bit array is distributed in the memory, the size of the bit array is N, and all the bit arrays are set to be 0. As shown in fig. 1, where N =3, the large circle is a hash circle with a value range of 0-

The service nodes A, B and C are corresponding Hash and fall on a Hash ring, the node A is responsible for processing a CA section, the node B is responsible for an AB section, and the node C is responsible for a BC section (only three service nodes are used for convenience of illustration, and the number of the initial service nodes can be set according to needs in practical application).

2. And performing hash calculation on the given URL, falling onto a hash ring, and finding a corresponding service node. As shown in fig. 3, the URL is hash-calculated to obtain H that falls on the hash ring and is located in the AB segment, and is processed by the serving node B.

3. The service node B processes the URL, calculates the URL by adopting k Hash functions to obtain H [0], H [1],. H [ k-1], and judges whether corresponding bits in a Bloom Filter structure are all 1. If the number of the URL is 1, returning a repeated URL and discarding the URL, otherwise, returning a non-repeated URL, adding the URL into a crawler task queue, recording an insertion log in addition to the next time, and increasing the insertion count by 1.

4. A service node is added. When the number of inserted records exceeds the threshold (set to 0.9 × c), the service nodes need to be added. At this time, the position of the node needs to be calculated and determined, then the corresponding Bloom Filter bitmap information is imported, and then the serving node is added into the hash ring. As shown in the following figure, since the service node B insertion record reaches the threshold value, the new node D is required to share the service pressure, after the node D is inserted, the node D processes the AD segment request, and the node B processes the DB segment request. The service node D needs to load the existing bitmap before starting service, namely, the write log of the service node B is re-imported, and for the record of the H value belonging to the AD section, the corresponding H0, H1, and H k-1 are set to be 1. The node B can re-import the write record in the background, and for the record of which the H value belongs to the DB segment, the corresponding H [0], H [1],. And H [ k-1] are positioned as 1, and the previous Bloom Filter can be replaced after the import is finished. This completes the operation of adding the service node.

Claims

1. A URL duplication removing method in a distributed crawler system is characterized by comprising the following steps:

s3, carrying out Hash calculation on the newly acquired URL to obtain H;

s5, the corresponding service node b calculates the URL by using k Hash functions to obtain k Hash values H [0], H [1],.

s7, discarding repeated URLs, and entering a step S3;

s8, putting the URL into a queue to be processed of the crawler;

s10, recording an insertion log, wherein the content is H, H0, H1, the.

s13, the corresponding service node b performs the operation of the step S12 to generate a new Bloom Filter, replaces the original Bloom Filter after the operation is completed, and enters a step S3;

in the Bloom Filter structure, the false positive probability is

，

Where k is the hash function number, a is the total number of bits, and c is the number of inserted elements, then given f, k, a, the maximum number of elements allowed to be inserted Cmax is:

。/>