CN111522654A

CN111522654A - Scheduling processing method, device and equipment for distributed crawler

Info

Publication number: CN111522654A
Application number: CN202010190446.0A
Authority: CN
Inventors: 杨绍琛
Original assignee: Dazhu Hangzhou Technology Co ltd
Current assignee: Dazhu Hangzhou Technology Co ltd
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2020-08-11

Abstract

The application discloses a scheduling processing method, device and equipment of distributed crawlers, and relates to the technical field of the Internet. The method comprises the following steps: firstly, distributing URL tasks to be crawled corresponding to data source websites to sub-clusters, wherein the sub-clusters are respectively concentrated in crawling of one data source website and consume bandwidth consumed by agent pools to which the sub-clusters are respectively distributed; then, in the process of crawling the data source website, acquiring the dynamic acquisition speed of each sub-cluster; and updating the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamic acquisition speed. According to the method and the device, the most reasonable cluster resource configuration can be dynamically calculated according to the acquisition speeds of different acquisition sources, an efficient distributed crawler solution is realized, and hardware resources are more reasonably utilized to improve the acquisition efficiency of the whole cluster. The method and the device are suitable for scheduling processing of the distributed crawlers.

Description

Scheduling processing method, device and equipment for distributed crawler

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, and a device for scheduling distributed crawlers.

Background

In the field of mass data acquisition, distributed crawlers have been widely used as a technical means capable of supporting high expansibility and high concurrency. In the actual use process, a distributed crawler supporting a large number of data acquisition tasks often involves the allocation of a plurality of software and hardware resources, including servers, databases, agent pools and the like.

At present, most of the existing distributed crawler systems are designed as a master-slave mode, and can be basically described as an architecture as follows: the main node manages task scheduling, including generation and deduplication of a Uniform Resource Locator (URL) queue to be crawled, data storage allocation and the like. The child nodes realize the acquisition of the crawling tasks through sockets or message queues, and inform the main node and acquire new tasks after data acquisition is completed until all the crawling tasks are completed. Such distributed crawler systems, in which child nodes initiate request tasks, have become popular in the industry.

However, in daily data development projects, since a plurality of crawlers developed by groups in a collaborative manner for different data source websites are deployed in a cluster, the conventional distributed crawler scheduling method may cause hardware resource waste and reduce data acquisition efficiency. For example, different types are divided according to data source websites, crawlers of type A are deployed on the cluster 1, and 10 machines are scheduled to collect type A data; and the crawler of type B is deployed on the cluster 2, and 5 machines are scheduled to collect type B data. After some time, type a data collection is complete while type B data is still in progress. Thus, even if the task queue is shared by the type A, B, the cluster 1 responsible for collecting type a data cannot continue to collect type B data, and time and resources are wasted.

Disclosure of Invention

In view of this, the present application provides a scheduling processing method, an apparatus, and a device for a distributed crawler, and mainly aims to solve the technical problems that hardware resources are wasted and data acquisition efficiency is reduced in a conventional distributed crawler scheduling method.

According to an aspect of the present application, a scheduling processing method for a distributed crawler is provided, the method including:

distributing URL tasks to be crawled corresponding to the data source websites to sub-clusters, wherein the sub-clusters are respectively concentrated in crawling of one data source website and consume bandwidth consumed by agent pools to which the sub-clusters are respectively distributed;

acquiring the dynamic acquisition speed of each sub-cluster in the process of crawling the data source website;

and updating the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamic acquisition speed.

According to another aspect of the present application, there is provided a scheduling processing apparatus for a distributed crawler, the apparatus including:

the data source website crawling system comprises an allocation module, a proxy pool and a crawling module, wherein the allocation module is used for allocating URL tasks to be crawled, which correspond to data source websites respectively, to sub-clusters, the sub-clusters are respectively concentrated in crawling of one data source website, and the allocated proxy pool consumes bandwidth;

the acquisition module is used for acquiring the dynamic acquisition speed of each sub-cluster in the process of crawling the data source website;

and the updating module is used for updating the number of the servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamic acquisition speed.

According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the scheduling processing method of the distributed crawler described above.

According to still another aspect of the present application, a scheduling processing apparatus for a distributed crawler is provided, which includes a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, and when the processor executes the program, the processor implements the scheduling processing method for the distributed crawler.

By means of the technical scheme, the scheduling processing method, the scheduling processing device and the scheduling processing equipment of the distributed crawler can allocate the URL tasks to be crawled corresponding to the data source websites to the sub-clusters in advance, wherein the sub-clusters are respectively concentrated in crawling of one data source website, and only the agent pools which are respectively allocated to the sub-clusters consume bandwidth. Compared with the traditional distributed crawler scheduling mode, the distributed crawler scheduling method has the advantages that the most reasonable cluster resource configuration can be dynamically calculated according to the dynamic acquisition speed of each sub-cluster in the process of crawling the data source website, the number of servers of each sub-cluster and the consumption bandwidth of the agent pool are updated, the data acquisition efficiency of each sub-cluster is balanced, the whole cluster resources are utilized to the maximum extent, the hardware resource waste is reduced, and the data acquisition efficiency is improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flowchart illustrating a scheduling processing method for a distributed crawler according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating another scheduling processing method for a distributed crawler according to an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating an application example provided by an embodiment of the present application;

fig. 4 shows a schematic structural diagram of a scheduling processing apparatus of a distributed crawler according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The method aims to solve the technical problems that hardware resources are wasted and data acquisition efficiency is reduced in the conventional distributed crawler scheduling mode. The embodiment provides a scheduling processing method for a distributed crawler, as shown in fig. 1, the method includes:

101. and distributing the URL tasks to be crawled corresponding to the data source websites to the sub-clusters.

Wherein, the whole cluster can be divided into a plurality of sub-clusters (sub-nodes in the distributed crawler system), each sub-cluster is respectively dedicated to the crawling of one data source website, and the consumption of the agent pool to which each sub-cluster is allocated consumes the bandwidth. I.e., each sub-cluster individually performs one type of crawling task. In this embodiment, the crawling task type can be flexibly determined according to a data source and a development mode, and common types include a static webpage, a dynamic webpage, an REST interface and the like.

The execution subject of this embodiment may be a device or an apparatus for scheduling processing of the distributed crawler, and may be specifically configured on the master node side of the distributed crawler system. The main node can be used for calculating resource requirements and crawling time corresponding to various types of crawling tasks in advance, dynamically allocating the resource requirements and the crawling time to child nodes in the cluster according to changes of the tasks to complete data acquisition, and specifically executing the processes shown in the steps 102 to 103.

102. And in the process of crawling the data source website, acquiring the dynamic acquisition speed of each sub-cluster.

For example, each sub-cluster starts the collection work for a specific data source website, and monitors the dynamic collection speed of each sub-cluster according to a certain time period, for example, the dynamic collection speed can be determined by parameters such as the total amount of sent requests and the consumed time length corresponding to each sub-cluster. And then sending the acquired dynamic acquisition speed of each sub-cluster to the main node side.

103. And updating the number of servers and the consumed bandwidth of the agent pool of each sub-cluster according to the dynamic acquisition speed of each sub-cluster.

For example, based on the example provided in the background, a crawler of type a is deployed on sub-cluster 1, and 10 machines are scheduled to collect type a data (a data source website); and the crawler of type B is deployed on the sub-cluster 2, and 5 machines are scheduled to collect type B data (B data source websites). If the traditional distributed crawler scheduling mode is adopted, after a period of time, the type A data collection is completed and the type B data is still in progress. Thus, even if the type A, B shares the task queue, the sub-cluster 1 responsible for collecting type a data cannot continue to collect type B data, and time and resources are wasted. By adopting the distributed crawler scheduling processing method in the embodiment, if the acquisition speed of the sub-cluster 1 responsible for acquiring the data source website a exceeds the expectation, the main node can timely reduce the number of servers of the sub-cluster 1, and the saved resources can be allocated to other sub-clusters to improve the overall acquisition efficiency. The corresponding required proxy bandwidth may also be allocated in real-time based on the server data.

According to the scheduling processing method of the distributed crawler, the URL tasks to be crawled corresponding to the data source websites can be allocated to the sub-clusters in advance, wherein the sub-clusters are respectively dedicated to crawling of one data source website, and only the bandwidth consumed by the allocated agent pools is consumed. Compared with the traditional distributed crawler scheduling mode, in the crawling process of the data source website, the most reasonable cluster resource configuration can be dynamically calculated according to the dynamic acquisition speed of each sub-cluster, the number of servers of each sub-cluster and the consumption bandwidth of the agent pool are updated, the data acquisition efficiency of each sub-cluster is balanced, the whole cluster resources are utilized to the maximum extent, the hardware resource waste is reduced, and the data acquisition efficiency is improved.

Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully describe the implementation of this embodiment, this embodiment further provides another scheduling processing method for a distributed crawler, as shown in fig. 2, where the method includes:

201. and acquiring the estimated acquisition speed and the number of URLs to be crawled corresponding to the data source websites.

In this embodiment, in order to more reasonably allocate resources of the distributed crawler system before crawling the data source websites, corresponding allocation may be performed according to the estimated acquisition speed of each data source website and the number of URLs to be crawled, and the processes shown in steps 202 to 203 may be specifically performed.

As an alternative, with one of the data source websitesTaking a data source website (i.e., a target data source website) as an example, acquiring an estimated acquisition speed corresponding to the target data source website may specifically include: firstly, sending an acquisition test request to a target data source website; and then determining the estimated acquisition speed of the target data source website according to the total test request amount and the test duration. For example, the initial acquisition rate may be derived from testing,

by the method, the estimated acquisition speed corresponding to the data source website can be accurately acquired, so that when the distributed crawler system resources are distributed as reference, the initial system resource distribution is more reasonable, the subsequent dynamic update frequency is reduced, and the dynamic adjustment resources are saved.

202. And calculating the number of servers and the consumed bandwidth of the agent pool required by the crawled data source website according to the acquired estimated acquisition speed and the number of URLs to be crawled.

As an optional manner, taking one of the data source websites (i.e., the target data source website) as an example, step 202 may specifically include: firstly, determining the total quantity of the estimation request of the collection task of a target data source website according to the quantity of URLs to be crawled of the target data source website; then, estimating the total amount of the request and the estimated acquisition speed of the target data source website according to the acquisition tasks, and calculating the number of target servers required by crawling the target data source website by combining the planned acquisition time; and finally, distributing the consumed bandwidth of the agent pool required by crawling the target data source website according to the number of the target servers.

For example,

then distributing agent pool consumed bandwidth required by crawling of the target data source website according to the calculated number of the servers, wherein the more the number of the servers is, the more the agent pool consumed bandwidth distributed to the target data source website is; and the fewer the number of servers, the less bandwidth is consumed by the pool of proxies to which the target data source web site is assigned. By the method, the method can accurately calculate and obtain the crawling requirement of each data source websiteThe number of servers and the agent pool consume bandwidth, so that when distributed crawler system resources are distributed as reference, the initial system resource distribution is more reasonable, the subsequent dynamic update frequency is reduced, and dynamic adjustment resources are saved.

203. And allocating the corresponding number of servers to form each sub-cluster according to the calculated number of the servers, and allocating the agent pool of each sub-cluster to consume the bandwidth.

Each sub-cluster is equivalent to a sub-node in the distributed crawler system, each sub-node is respectively dedicated to crawling of one data source website, and only the consumption bandwidth of the respectively allocated agent pool is consumed.

Through the mode of creating the child nodes, resources of the distributed crawler system are more reasonably distributed before the data source website is crawled, the frequency of subsequent dynamic updating is reduced, and dynamic adjustment resources are saved.

204. And distributing the URL tasks to be crawled corresponding to the data source websites to the sub-clusters.

And each sub-cluster executes the distributed URL task to be crawled, namely, the sub-cluster starts to crawl the data source website appointed by each sub-cluster.

205. And in the process of crawling the data source website, acquiring the dynamic acquisition speed of each sub-cluster.

As an alternative, taking one of the sub-clusters (i.e. the target sub-cluster) as an example, acquiring the dynamic acquisition speed of the target sub-cluster may specifically include: firstly, acquiring the total amount of requests sent by a target sub-cluster according to a corresponding data source website and long consumed time; and then determining the dynamic acquisition speed of the target sub-cluster according to the total amount of the transmitted requests and the consumed time.

For example,

and feeding back the real-time dynamic acquisition speed of the target sub-cluster to the master node according to a certain time period so that the master node is updated as a reference. By the method, the dynamic acquisition speed of each sub-cluster can be accurately acquired, so that the distribution can be accurately updated subsequentlyAnd (4) each child node resource in the crawler system.

206. And updating the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the acquired dynamic acquisition speed.

To illustrate the specific implementation process of step 206, as an alternative, step 206 may specifically include: if the dynamic acquisition speed of the first sub-cluster is greater than the corresponding estimated acquisition speed, reducing the number of servers of the first sub-cluster according to the acquisition speed difference value and the corresponding preset reduction proportion (preset according to actual requirements), and reallocating the agent pool consumption bandwidth of the first sub-cluster according to the reduced number of servers; and if the dynamic acquisition speed of the second sub-cluster is smaller than the corresponding pre-estimated acquisition speed, increasing the number of the servers of the second sub-cluster by using the removed servers in the first sub-cluster according to the corresponding preset increase proportion (preset according to actual requirements) according to the acquisition speed difference, and reallocating the consumed bandwidth of the agent pool of the second sub-cluster according to the increased number of the servers.

For example, if the dynamic acquisition speed of the sub-cluster a is greater than the corresponding estimated acquisition speed, the number of servers of the sub-cluster a is reduced according to the acquisition speed difference value and the corresponding reduction proportion, and the bandwidth consumed by the agent pool of the sub-cluster a is redistributed according to the reduced number of servers. And if the dynamic acquisition speed of the sub-cluster b is smaller than the corresponding estimated acquisition speed, increasing the number of the servers of the sub-cluster b by utilizing the removed servers in the sub-cluster a, the updated removed servers of other sub-clusters, other idle servers in the distributed crawler system and the like according to the corresponding proportion of the acquisition speed difference, and redistributing the agent pool consumption bandwidth of the sub-cluster b according to the increased number of the servers.

Through the dynamic updating mode, the most reasonable cluster resource configuration is dynamically calculated according to the acquisition speeds of different acquisition sources, an efficient distributed crawler solution is realized, and hardware resources are more reasonably utilized to improve the acquisition efficiency of the whole cluster.

Further, in order to fully utilize resources of the distributed crawler system, optionally, after determining that the URL tasks to be crawled corresponding to the sub-clusters are completely executed, if the URL tasks to be crawled of the new data source websites exist, allocating the URL tasks to be crawled of the new data source websites to the sub-clusters where the tasks are completely executed, and continuing to execute until all the URL tasks to be crawled are completely executed. By the method, resources of the distributed crawler system can be efficiently utilized, and the efficiency of crawling the data source website is maximized.

In order to illustrate the specific implementation process of the above embodiments and combine the problems in the prior art, the following application examples are given, but not limited to:

for the scheduling mode of the existing distributed crawler system, as a plurality of crawlers developed by groups in a cooperative manner and aiming at different data sources are deployed in a cluster, the traditional design can cause some situations of business conflict or hardware resource waste. In the embodiment, in order to solve the problem of fully utilizing server resources, a design that a master node actively allocates tasks is adopted. And (4) finishing crawling consumed resources of each data source through calculation in advance, and dynamically updating instructions and tasks to child nodes in charge of crawling. The whole system is uniformly scheduled by the main node, so that the whole cluster resource is utilized to the maximum extent, and the data acquisition efficiency is improved. The flow shown in fig. 3 may be specifically executed:

(1) and the main node calculates the scale of each sub-cluster and related resource consumption, such as the number of servers required by crawling each website and the consumption bandwidth of an agent pool, according to the estimated acquisition speed of each data source website and the number of URLs to be crawled.

(2) And (3) the main node allocates a corresponding number of servers according to the calculation result in the step (1) to form each sub-cluster, each sub-cluster is focused on crawling of one data source, and only the allocated agent bandwidth is consumed.

(3) And the main node distributes the URL to be crawled of each data source to each sub-cluster through the message queue.

(4) And the sub-cluster starts the acquisition work aiming at a specific data source and feeds back the acquisition speed to the main node according to a certain time period.

(5) And the main node updates the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamically collected collection speed of each sub-cluster. And (4) if all the collection tasks are finished, continuing, otherwise, returning to the step (2).

(6) And (3) completing all the acquisition tasks, releasing the allocation of all the sub-clusters by the main node, receiving a new round of data source, and returning to the step (1) to repeat the process.

Distributed crawlers are mass data collection sharps in the big data era. By utilizing a reasonable task allocation and resource scheduling mechanism, the elastically expandable distributed cluster has higher data acquisition efficiency. The embodiment provides a distributed crawler system for dynamically allocating crawling tasks, which has the characteristics of expandability and high concurrency, and simultaneously utilizes hardware resources more reasonably to improve the acquisition efficiency of the whole cluster. It is equivalent to proposing a distributed crawler solution that makes more efficient use of cluster resources.

Further, as a specific implementation of the method shown in fig. 1 and fig. 2, this embodiment further provides a scheduling processing apparatus for distributed crawlers, as shown in fig. 4, the apparatus includes: an allocation module 31, an acquisition module 32, and an update module 33.

The allocating module 31 may be configured to allocate to each sub-cluster, respective URL tasks to be crawled corresponding to data source websites, where the sub-clusters are respectively dedicated to crawling of one data source website and consume bandwidth consumed by the agent pools to which they are respectively allocated;

the acquisition module 32 is configured to acquire a dynamic acquisition speed of each sub-cluster in a process of crawling the data source website;

and the updating module 33 is configured to update the number of servers of each sub-cluster and the bandwidth consumed by the proxy pool according to the dynamic acquisition speed.

In a specific application scenario, the apparatus further comprises: a calculation module and a deployment module;

the obtaining module 32 may be further configured to obtain estimated acquisition speeds and numbers of URLs to be crawled corresponding to the data source websites before distributing the URL tasks to be crawled corresponding to the data source websites to the respective sub-clusters;

the calculation module is used for calculating the number of servers and the consumed bandwidth of the agent pool required by crawling the data source website according to the estimated acquisition speed and the number of the URLs to be crawled;

and the allocating module is used for allocating the corresponding number of servers to form each sub-cluster according to the calculated number of the servers, and allocating the consumed bandwidth of the agent pool of each sub-cluster.

In a specific application scenario, the updating module 33 is specifically configured to reduce the number of servers of the first sub-cluster according to a corresponding preset reduction ratio according to a collection speed difference value if the dynamic collection speed of the first sub-cluster is greater than the corresponding estimated collection speed, and reallocate the bandwidth consumed by the proxy pool of the first sub-cluster according to the reduced number of servers; and if the dynamic acquisition speed of the second sub-cluster is smaller than the corresponding pre-estimated acquisition speed, increasing the number of the servers of the second sub-cluster by using the removed servers in the first sub-cluster according to the corresponding preset increase proportion according to the acquisition speed difference, and reallocating the consumed bandwidth of the proxy pool of the second sub-cluster according to the increased number of the servers.

In a specific application scenario, the obtaining module 32 may be specifically configured to send an acquisition test request to a target data source website; and determining the estimated acquisition speed of the target data source website according to the total test request amount and the test duration.

In a specific application scenario, the calculation module is specifically used for determining the total estimated request amount of the collection task of the target data source website according to the number of URLs to be crawled of the target data source website; calculating the number of target servers required for crawling the target data source website according to the estimated request total amount of the acquisition tasks and the estimated acquisition speed of the target data source website and by combining with the planned acquisition time; and distributing the consumed bandwidth of the agent pool required by crawling the target data source website according to the number of the target servers.

In a specific application scenario, the obtaining module 32 may be specifically configured to obtain the total amount of requests sent by the target sub-cluster according to the corresponding data source website and the consumed time is long; and determining the dynamic acquisition speed of the target sub-cluster according to the total amount of the sent requests and the consumed time.

In a specific application scenario, the allocating module 31 may be further configured to, after determining that the to-be-crawled URL task corresponding to the sub-cluster is completely executed, allocate the to-be-crawled URL task of the new data source website to the sub-cluster in which the task is completely executed to continue executing if the to-be-crawled URL task of the new data source website exists.

It should be noted that other corresponding descriptions of the functional units related to the scheduling processing apparatus for a distributed crawler provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.

Based on the methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the scheduling processing method for the distributed crawler shown in fig. 1 and fig. 2.

Based on such understanding, the technical solution of the present embodiment may be embodied in the form of a software product, where the software product may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments of the present application.

Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 4, in order to achieve the above object, this embodiment further provides a scheduling processing device for a distributed crawler, which may specifically be a personal computer, a server, a tablet computer, a smart phone, a smart watch, a smart bracelet, or other network devices, and the device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the scheduling processing method of the distributed crawler as shown in fig. 1 and 2.

Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be understood by those skilled in the art that the above-described physical device structure provided in the present embodiment is not limited to the physical device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the above-described physical devices, and supports the operation of the information processing program as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme of the embodiment, the URL tasks to be crawled corresponding to the data source websites can be distributed to the sub-clusters in advance, wherein the sub-clusters are respectively concentrated in crawling of one data source website, and only the consumption bandwidth of the distributed agent pools is consumed. Compared with the traditional distributed crawler scheduling mode, in the crawling process of the data source website, the most reasonable cluster resource configuration can be dynamically calculated according to the dynamic acquisition speed of each sub-cluster, the number of servers of each sub-cluster and the consumption bandwidth of the agent pool are updated, the data acquisition efficiency of each sub-cluster is balanced, the whole cluster resources are utilized to the maximum extent, the hardware resource waste is reduced, and the data acquisition efficiency is improved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A scheduling processing method of a distributed crawler is characterized by comprising the following steps:

2. The method according to claim 1, wherein before distributing the URL tasks to be crawled corresponding to the data source websites to the respective sub-clusters, the method further comprises:

acquiring the estimated acquisition speed and the number of URLs to be crawled corresponding to the data source websites;

according to the estimated acquisition speed and the number of the URLs to be crawled, calculating the number of servers and the consumed bandwidth of an agent pool required by crawling of the data source website;

and allocating the corresponding number of servers to form each sub-cluster according to the calculated number of the servers, and allocating the bandwidth consumed by the agent pool of each sub-cluster.

3. The method according to claim 2, wherein updating the number of servers and the bandwidth consumed by the proxy pool of each of the sub-clusters according to the dynamic collection speed specifically comprises:

if the dynamic acquisition speed of the first sub-cluster is greater than the corresponding estimated acquisition speed, reducing the number of servers of the first sub-cluster according to the acquisition speed difference value and the corresponding preset reduction proportion, and reallocating the agent pool consumption bandwidth of the first sub-cluster according to the reduced number of servers;

and if the dynamic acquisition speed of the second sub-cluster is smaller than the corresponding pre-estimated acquisition speed, increasing the number of the servers of the second sub-cluster by using the removed servers in the first sub-cluster according to the corresponding preset increase proportion according to the acquisition speed difference, and reallocating the consumed bandwidth of the proxy pool of the second sub-cluster according to the increased number of the servers.

4. The method according to claim 2, wherein the obtaining of the estimated acquisition speed corresponding to each of the data source websites specifically comprises:

sending a collection test request to a target data source website;

and determining the estimated acquisition speed of the target data source website according to the total test request amount and the test duration.

5. The method according to claim 2, wherein the calculating, according to the estimated acquisition speed and the number of URLs to be crawled, the number of servers and the consumed bandwidth of the agent pool required for crawling the data source websites respectively comprises:

determining the total quantity of the collection task prediction requests of a target data source website according to the quantity of URLs to be crawled of the target data source website;

calculating the number of target servers required for crawling the target data source website according to the estimated request total amount of the acquisition tasks and the estimated acquisition speed of the target data source website and by combining with the planned acquisition time;

and distributing the consumed bandwidth of the agent pool required by crawling the target data source website according to the number of the target servers.

6. The method according to any one of claims 1 to 5, wherein the obtaining the dynamic acquisition speed of each of the sub-clusters specifically comprises:

acquiring the total amount of requests sent by a target subset group according to a corresponding data source website and the consumed time is long;

and determining the dynamic acquisition speed of the target sub-cluster according to the total amount of the sent requests and the consumed time.

7. The method according to any one of claims 1 to 6, further comprising:

after the URL task to be crawled corresponding to the subset group is determined to be executed, if the URL task to be crawled of a new data source website exists, the URL task to be crawled of the new data source website is allocated to the subset group of which the task is executed to be executed continuously.

8. A dispatch processing apparatus of distributed crawler, comprising:

9. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the scheduling processing method of a distributed crawler according to any one of claims 1 to 7.

10. A scheduling processing apparatus for a distributed crawler, comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the scheduling processing method for a distributed crawler according to any one of claims 1 to 7 when executing the program.