CN107193960B

CN107193960B - Distributed crawler system and periodic incremental grabbing method

Info

Publication number: CN107193960B
Application number: CN201710372282.1A
Authority: CN
Inventors: 张雷; 韩建军; 张文哲; 谭龙海; 王崇骏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2020-11-10
Anticipated expiration: 2037-05-24
Also published as: CN107193960A

Abstract

The invention discloses a distributed crawler system which is configured into three major parts, namely a distributed service based on ZooKeeper, a system component and a database, wherein the system component comprises a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic crawler component Spider, the database comprises a Redis memory database, Redis is a key-value storage form, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database. The invention also discloses a periodic increment grabbing method based on the system, which comprises the following steps: the coordination component Coordinator periodically imports tasks into a distributed URL task queue and wakes up a dormant Spider component; and the Spider component performs sleep or periodic incremental grabbing according to the execution condition of the current distributed URL task queue. The system and the method solve the problem of how to effectively combine single-machine crawlers together, realize the distributed crawlers with high availability, high stability and high throughput rate in a cluster environment, and realize periodic incremental grabbing.

Description

Distributed crawler system and periodic incremental grabbing method

Technical Field

The invention relates to the technical field of efficient data acquisition of internet big data, in particular to a distributed crawler system and a periodic incremental grabbing method.

Background

The web crawler obtains the URL on the initial web page starting from the URL (Uniform Resource Locator) of one or more initial web pages, and continuously extracts new URLs from the current web page to be put into a task queue according to different capturing strategies in the process of capturing the web page until the stopping condition of the system is met.

With the rapid development of the internet, the network data is in explosive growth, and the network data source is more and more diversified. In the face of such huge and diversified internet data, how to improve the capturing efficiency of the web crawler and how to perform a customizable crawling strategy aiming at different data sources is very important.

Compare in traditional stand-alone reptile, the efficiency of snatching that the reptile can obviously be improved to the distributed reptile, but also introduced new problem thereupon: the method comprises the following steps of multi-node task issuing problem, load balancing problem, webpage repeatability problem, periodic increment grabbing problem and the like in the distributed environment.

To sum up, how to solve a series of problems brought by the distributed crawler and simultaneously effectively improve the crawling speed of the crawler, and realizing the periodic incremental crawling of the distributed crawler on the basis is a main problem existing in the prior art.

Disclosure of Invention

The invention aims to provide a distributed crawler system and a periodic incremental grabbing method, which solve the problem of how to effectively combine single-machine crawlers together, realize the distributed crawlers with high availability, high stability and high throughput rate in a cluster environment, and realize the periodic incremental grabbing. The technical scheme adopted by the invention for solving the problems is as follows:

the invention discloses a distributed crawler system, which is configured to three major parts, namely a distributed service based on ZooKeeper, a system component and a database, wherein the system component comprises a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic crawler component Spider, the database comprises a Redis memory database, Redis is a key-value storage form, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database; the ZooKeeper-based distributed service provides distributed coordination service for each system component; the system monitoring component Monitor is responsible for dynamic configuration of the system and state monitoring of the system; the coordination component Coordinator is responsible for importing the seed URL into a distributed task queue based on Redis, periodically summarizing the state of each node to a ZooKeeper, and dynamically allocating one or more of detection and management of a log source and a cluster node for a log collection component Logger; the log collection component Logger is responsible for collecting log data from each basic crawler component Spider in the cluster; the basic crawler component Spider is responsible for processing a crawling task of a webpage; the Redis-based distributed URL task queue is responsible for storing all task URLs to be crawled; the Redis-based distributed BloomFilter is responsible for URL deduplication requests of all basic crawler components Spiders in the cluster.

Further, the ZooKeeper-based distributed service provides one or more of distributed services including dynamic configuration, cluster node detection and management, Master election, distributed locks, and ID generation of global URLs for each system component through the mutual coordination work with each system component.

Furthermore, the system monitoring component Monitor has a Monitor interface, a user can modify the system configuration parameters existing on the ZooKeeper through the Monitor interface, the coordinating component Coordinator, the log collecting component Logger and the basic crawler component Spider in the cluster can Monitor the corresponding data nodes on the ZooKeeper, and obtain corresponding notifications after the contents of the data nodes are modified, and then make corresponding adjustments according to the modified configuration parameters.

Further, the Monitor interface can also display the state parameters of each system and each component existing on the ZooKeeper in real time.

Further, the basic crawler component Spider component has multiple component kernels, and the crawling strategies of the component kernels are not completely consistent.

Further, the basic crawler component Spider component has high expansibility, so that a new component kernel can be written conveniently aiming at a new data source.

Further, a task distribution mode of the distributed URL task queue adopts a Pull (Pull) mode of a basic crawler component Spider.

Furthermore, the distributed BloomFilter adopts a segmentation mechanism to segment bit vectors and store the bit vectors on keys of Redis, and realizes the synchronicity control of the Spider access of each basic crawler component through a segmented optimistic lock.

The invention also discloses a periodic increment grabbing method based on the distributed crawler system, which comprises the following steps: the coordination component Coordinator periodically imports tasks into a distributed URL task queue and wakes up a dormant Spider component; the Spider component conducts dormancy or periodic incremental grabbing according to the execution condition of the current distributed URL task queue, when the task is not grabbed, the Spider component enters a dormant state, and when the dormant Spider component is awakened by other Spider components or coordinators, the grabbing task continues.

Further, the method comprises the following steps:

s1, the coordination component Coordinator periodically imports the task to the distributed URL task queue and wakes up the dormant Spider component. That is, the Coordinator component of the system will periodically import tasks into the distributed URL queue, and after the tasks are imported, the Coordinator will wake up all dormant spiders to start a new round of incremental fetching tasks. The grabbing task is executed periodically, and each period starts from the importing of the seed task.

And S2, judging whether the periodic increment grabbing of the system is finished or not by the Spider component, if so, executing S6, and otherwise, executing S3. That is, a capturing thread in the Spider component can check corresponding data node information in the ZooKeeper, the data node information is set by a Monitor, and when the periodic increment capturing of the system is read to be finished, the Spider component can carry out a series of cleaning and saving work and then finish the process of the Spider component; otherwise, periodic incremental grabbing continues.

And S3, judging whether the current distributed task queue is empty, if so, executing S4, otherwise, jumping to S5. That is, a grabbing thread in the spinner component checks whether tasks to be grabbed still exist in a distributed task queue in Redis, and if yes, the tasks are acquired and enter a grabbing stage; otherwise, the sleep phase is entered.

S4, entering a basic crawler assembly Spider dormancy stage, mainly comprising: 1) blocking a capture thread or a dormant base crawler component (Spider) component, 2) waking up a thread;

the method specifically comprises the following steps: a) judging whether the current Spider component except the current grabbing thread and other grabbing threads are blocked, if so, executing the step b), otherwise, executing the step c); b) a dormancy marking node is created in the ZooKeeper, the node can be used for indicating that the current Spider assembly is dormant, and when other assemblies need to wake up the Spider assembly, only the data node needs to be deleted; c) blocking the fetch thread; d) the grabbing thread is blocked and waits for other threads to wake up; e) the grab thread is woken up by the other threads and executes S2.

When the tasks are not grabbed, the Spider components enter the stage, the Spider components sleep to avoid the idle consumption of system resources, and when other Spider components have new tasks to be added into a task queue or a new round of incremental grabbing starts, the sleeping Spider components are awakened by other Spider components or Coordinator components to continue grabbing the tasks.

S5, entering a basic crawler component (Spider) grabbing stage, and specifically comprising the following steps: the method comprises the following steps: 1) acquiring tasks from a distributed URL task queue, and 2) a Spider executes a grabbing task; 3) the grab thread or base crawler component (Spider) component is awakened.

The method specifically comprises the following steps: a) acquiring a grabbing task from a distributed URL queue; b) capturing a corresponding webpage according to the acquired task and storing a result; c) analyzing the captured webpage hyperlink and acquiring a new task set; d) sending the acquired new task to a distributed BloomFilter for duplicate removal; e) adding the new task after the duplication removal to a distributed task queue; f) judging whether the current Spider component is blocked by a capturing thread, if so, executing the step g), otherwise, executing the step h); g) waking up a blocked grabbing thread in the current spinner component; h) and judging whether other Spider components in the current cluster sleep or not, if so, awakening the corresponding sleep Spider, and otherwise, executing S2.

And S6, ending. That is, when each component detects that the system needs to stop working, the respective process is finished after necessary cleaning work is performed.

Compared with the prior art, the distributed crawler system and the periodic increment capturing method have the following beneficial effects that aiming at huge and diversified internet data:

1) the realization is simple: the distributed crawler system is constructed based on the open-source distributed coordination service ZooKeeper and the open-source distributed memory database Redis, and deep development is carried out on the basis of utilizing a technical framework, so that specific requirements are met, and development cost is reduced.

2) High performance: the grabbing task adopts a multi-node multi-thread working mode, high performance of webpage grabbing is achieved, and linear expansion of the Spider component is supported.

3) High availability: based on ZooKeeper and Redis, all components of the system work in a cluster mode, so that the problem of single-node breakdown is avoided, and a high-availability and high-stability webpage capturing service is realized externally.

4) Automatic periodic incremental grabbing: after the initial task and the related system parameters are set at one time, the system can automatically carry out periodical incremental grabbing service without human intervention.

5) The customizable grabbing strategy is as follows: the Spider component comprises a plurality of component kernels, each component kernel corresponds to a different crawling strategy, and the Spider is designed into a highly-extended component, so that a new component kernel can be written conveniently aiming at a new data source.

6) The expansibility is good: all components of the system are organized together with low coupling, the influence of the up and down lines of any single node on the system is very little, and the linear expansion of each component is supported.

Therefore, the method has the advantages of reasonable design, simple architecture, high availability, high stability, high performance, good expansibility and the like.

Drawings

FIG. 1 is a diagram of the distributed crawler system architecture

FIG. 2 is a main flow chart of a periodic incremental capture method

FIG. 3 is a flow chart of the Spider grabbing phase

FIG. 4 is a flow chart of the Spider sleep stage

Detailed Description

In order to better understand the technical content of the invention, specific embodiments are specifically illustrated and further described in conjunction with the accompanying drawings.

FIG. 1 is a diagram of a distributed crawler system architecture of the present invention, which includes three major parts, a ZooKeeper-based distributed service, system components, and a database. The ZooKeeper-based distributed service provides distributed coordination service for each system component; the system component comprises a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic crawler component Spider; the database comprises a Redis memory database and other databases for storing and capturing webpages, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database.

The ZooKeeper-based distributed service provides distributed coordination services such as dynamic configuration, cluster node detection and management, Master election, distributed lock, and ID generation of global URL for each system component through the mutual coordination work with each system component. The ZooKeeper maintains a tree data structure similar to a file system in a memory, and the distributed services based on the ZooKeeper can be realized by creating, querying, deleting and monitoring corresponding data nodes of all components on the ZooKeeper data structure.

The system monitoring component Monitor is responsible for dynamic configuration of the system and state monitoring of the system. The user can modify the system configuration parameters (such as parameters of spiders of basic crawler components) existing on the ZooKeeper through a Monitor interface, each corresponding component (including spiders, coordinators and logers) in the cluster can Monitor the corresponding data node on the ZooKeeper, and each corresponding component can obtain corresponding notification after the content of the data node is modified, namely, the notification of data change sent by the ZooKeeper, and then each component can make corresponding adjustment according to the modified configuration parameters. The Monitor interface can also display the state parameters of each system and each component existing on the ZooKeeper in real time, so that a user can Monitor in real time, find problems in time and carry out corresponding remedial measures. The system configuration parameters mainly include a seed import period, a regular constraint, a number of grabbing threads, a grabbing depth, a maximum error number and the like, and many other configuration parameters with great details.

And the coordination component Coordinator is responsible for importing the URL of the seed webpage into the distributed task queue, periodically summarizing the state of each node to the ZooKeeper, and dynamically distributing log sources and cluster nodes for the log collection component Logger for detection and management.

The log collection component Logger is responsible for collecting log data from each base crawler component Spider in the cluster for subsequent log analysis.

The basic crawler component Spider is responsible for specific webpage crawling tasks, the Spider component comprises multiple component kernels, each component kernel corresponds to different crawling strategies, the Spider is designed into a high-expansion component, and new component kernels can be written conveniently aiming at new data sources. In the crawling process, the Spider component firstly carries out corresponding initialization according to system configuration, then continuously requests URLs from the distributed task queue, switches corresponding crawling strategies according to the corresponding URLs, crawls webpages, extracts webpage features and texts, stores extraction results, analyzes webpage hyperlinks, removes the weight of the newly acquired URLs through the distributed BloomFilter and then adds the URLs into the distributed task queue until the distributed task queue is empty.

The Redis-based distributed URL task queue is responsible for storing all task URLs to be crawled. A task distribution mode adopts a pulling (Pull) mode of a basic crawler component Spider, and when the current crawling task of the Spider is finished, the Spider can actively Pull a new task from a distributed queue to carry out the next round of work. It is worth noting that in the case of the current distributed queue based on the Redis, a Pull (Pull) mode is the best and simplest mode, and other modes combining push and Pull can also be implemented, but both modes need to be implemented additionally, and the Pull mode does not need to be implemented additionally.

The Redis-based distributed BloomFilter is responsible for URL deduplication requests of all basic crawler components Spiders in the cluster. Redis is a Key-value storage form, the distributed BloomFilter adopts a segmentation mechanism to segment and store bit vectors on different keys of Redis, and realizes the synchronicity control of the access of each Spider through a segmented optimistic lock. The implementation mechanism of the segmented optimistic lock is as follows: the deduplication request of each Spider firstly calculates keys corresponding to all sections of bit vectors to be accessed, then monitors the keys (keys), and then initiates Redis transactions of all sections of bit vector updating requests, wherein when the transactions are executed, whether bit vectors corresponding to the monitored keys are changed after monitoring is firstly checked, and if yes, the transactions are abandoned and executed, and the deduplication requests are automatically and repeatedly initiated; otherwise, the update is successful and the deduplication URL is successfully added to BloomFilter. The distributed BloomFilter based on the segmentation mechanism and the optimistic lock implementation not only can provide high-throughput deduplication requests, but also can be expanded along with the linear expansion of the Redis cluster, and capacity limitation does not exist.

The embodiment also discloses a periodic increment grabbing method based on the distributed crawler system, and the method is described in detail with reference to fig. 2 to 4.

Fig. 2 is a main flowchart of the periodic increment capture method in the embodiment, which is specifically introduced as follows:

step 1-0, periodically and incrementally grabbing an initial state of the method;

step 1-1, periodically importing a task to a distributed URL task queue by a coordination component Coordinator;

step 1-2, judging whether to finish the periodic incremental grabbing of the system: if the judgment result in the step 1-2 is yes, entering the step 1-9, otherwise, executing the step 1-3;

step 1-3, judging whether a current distributed task queue is empty; and if so, entering a Spider sleep stage and executing corresponding steps 1-4 and 1-5, otherwise, entering a Spider grabbing stage and executing corresponding steps 1-6, 1-7 and 1-8.

Step 1-4, blocking a grabbing thread or a dormant Spider component;

step 1-5, the blocking thread is awakened, and step 1-2 is executed;

step 1-6, acquiring tasks from a distributed queue;

1-7, executing a specific grabbing task by a Spider;

step 1-8, waking up a grabbing thread or a spinner component, and executing step 1-2;

step 1-9, ending state.

Fig. 3 is a flowchart of the capturing stage of the spreader in the embodiment, which specifically introduces the following steps:

step 2-0, starting a Spider grabbing stage, wherein the step is immediately followed by the step 1-3;

step 2-1, acquiring tasks from a distributed queue;

step 2-2, capturing a corresponding webpage according to the acquired task and storing a result;

step 2-3, analyzing the captured webpage hyperlink and acquiring a new task set;

step 2-4, removing the duplicate of the acquired new task to a distributed BloomFilter;

step 2-5, adding the new task after the duplication removal to a distributed task queue;

step 2-6, judging whether the Spider assembly is blocked by a capturing thread, if so, executing the step 2-7, otherwise, executing the step 2-8;

step 2-7, awakening the blocked capturing thread of the Spider component;

2-8, judging whether other Spider assemblies in the current cluster are dormant or not, if so, executing the step 2-9, otherwise, executing the step 2-10;

step 2-9, waking up the dormant Spider;

step 2-10, the end state of the Spider grabbing phase, and then step 1-2 is executed.

Fig. 4 is a flowchart of the sleep stage of the spreader in the embodiment, which is specifically introduced as follows:

step 3-0, starting a Spider dormancy stage, wherein the step is immediately followed by the step 1-3;

3-1, judging whether the Spider assembly is blocked by other grabbing threads except the grabbing threads, if so, executing the step 3-2, otherwise, executing the step 3-3;

step 3-2, a dormancy marking node is created in the ZooKeeper, the node can be used for indicating that the corresponding Spider component is dormant, and when other components need to wake up the Spider component, only the data node needs to be deleted;

3-3, blocking the capturing thread;

3-4, the capturing thread is blocked, and other threads are waited to be awakened;

3-5, the capturing thread is awakened by other threads;

step 3-6, end state of the Spider sleep phase, and then step 1-2 is executed.

Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described embodiments and application fields, and the above-described embodiments are illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto without departing from the scope of the invention as defined by the appended claims.

Claims

1. A distributed crawler system is characterized in that the system is configured to be distributed service based on ZooKeeper, system components and a database, wherein the system components comprise a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic crawler component Spider, the database comprises a Redis memory database, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database; wherein, the ZooKeeper-based distributed service provides distributed coordination service for each system component,

the system monitoring component Monitor is responsible for dynamic configuration of the system and status monitoring of the system,

the coordination component Coordinator is responsible for importing the seed URL into a distributed task queue based on Redis, periodically summarizing the state of each node to ZooKeeper, dynamically allocating one or more of log source and detection and management of cluster nodes for the log collection component Logger,

the log collection component Logger is responsible for collecting log data from each basic crawler component Spider in the cluster, the basic crawler component Spider is responsible for processing the crawling task of the web pages,

the Redis-based distributed URL task queue is responsible for storing all task URLs to be crawled,

the distributed BloomFilter based on the Redis is responsible for URL (Uniform resource locator) deduplication requests of all basic crawler components Spiders in the cluster; the ZooKeeper-based distributed service provides one or more of distributed services including dynamic configuration, cluster node detection and management, Master election, distributed locks and ID generation of global URLs for each system component through the mutual coordination work with each system component.

2. The distributed crawler system according to claim 1, wherein the system monitoring component Monitor has a Monitor interface, a user can modify system configuration parameters existing on the ZooKeeper through the Monitor interface, and the coordination component coorditor, the log collection component Logger and the basic crawler component Spider in the cluster Monitor corresponding data nodes on the ZooKeeper and obtain corresponding notifications after contents of the data nodes are modified, and then make corresponding adjustments according to the modified configuration parameters.

3. The distributed crawler system of claim 2, wherein the Monitor interface is further capable of displaying in real time the system state parameters and the component state parameters that exist on the ZooKeeper.

4. A distributed crawler system according to claim 1, wherein said base crawler component Spider component has multiple component kernels, and the crawling strategies of the component kernels are not completely consistent.

5. The distributed crawler system of claim 1, wherein the base crawler component Spider component is highly extensible to facilitate writing new component kernels for new data sources.

6. The distributed crawler system of claim 1, wherein the task distribution of the distributed URL task queue is a pull of a base crawler component Spider.

7. The distributed crawler system according to claim 1, wherein the distributed BloomFilter adopts a segmentation mechanism to segment bit vectors stored on keys different from Redis, and realizes synchronization control of access of each base crawler component Spider through a segmented optimistic lock.

8. A periodic incremental crawling method, based on the distributed crawler system of any one of claims 1 to 7, comprising: the coordination component Coordinator periodically imports tasks into a distributed URL task queue and wakes up a dormant Spider component; the Spider component conducts dormancy or periodic incremental grabbing according to the execution condition of the current distributed URL task queue, when the task is not grabbed, the Spider component enters a dormant state, and when the dormant Spider component is awakened by other basic crawler components or Coordinator components, the task can be continuously grabbed.

9. The method of periodic incremental grabbing according to claim 8, comprising the steps of:

s1, the coordination component Coordinator periodically imports tasks to a distributed URL task queue and wakes up a dormant Spider component;

s2, judging whether the periodic incremental grabbing of the system is finished, if so, jumping to S6, otherwise, executing S3;

s3, judging whether the current distributed task queue is empty, if so, executing S4, otherwise, jumping to S5;

s4, entering a basic crawler component (Spider) dormancy stage, and comprising the following steps: a) judging whether the current Spider component except the current grabbing thread and other grabbing threads are blocked, if so, executing the step b), otherwise, executing the step c); b) a dormancy marking node is created in the ZooKeeper, the node can be used for indicating that the current Spider assembly is dormant, and when other assemblies need to wake up the Spider assembly, only the data node needs to be deleted; c) blocking the fetch thread; d) the grabbing thread is blocked and waits for other threads to wake up; e) the grab thread is woken up by other threads and executes S2;

s5, entering a basic crawler component Spider grabbing stage, comprising: a) acquiring a grabbing task from a distributed URL queue; b) capturing a corresponding webpage according to the acquired task and storing a result; c) analyzing the captured webpage hyperlink and acquiring a new task set; d) sending the acquired new task to a distributed BloomFilter for duplicate removal; e) adding the new task after the duplication removal to a distributed task queue; f) judging whether the current Spider component is blocked by a capturing thread, if so, executing the step g), otherwise, executing the step h); g) waking up a blocked grabbing thread in the current spinner component; h) judging whether other Spider assemblies in the current cluster are dormant or not, if so, awakening the corresponding dormant Spider, otherwise, executing S2;

and S6, ending.