CN111522654A - Scheduling processing method, device and equipment for distributed crawler - Google Patents

Scheduling processing method, device and equipment for distributed crawler Download PDF

Info

Publication number
CN111522654A
CN111522654A CN202010190446.0A CN202010190446A CN111522654A CN 111522654 A CN111522654 A CN 111522654A CN 202010190446 A CN202010190446 A CN 202010190446A CN 111522654 A CN111522654 A CN 111522654A
Authority
CN
China
Prior art keywords
sub
data source
cluster
servers
acquisition speed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010190446.0A
Other languages
Chinese (zh)
Inventor
杨绍琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dazhu Hangzhou Technology Co ltd
Original Assignee
Dazhu Hangzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dazhu Hangzhou Technology Co ltd filed Critical Dazhu Hangzhou Technology Co ltd
Priority to CN202010190446.0A priority Critical patent/CN111522654A/en
Publication of CN111522654A publication Critical patent/CN111522654A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a scheduling processing method, device and equipment of distributed crawlers, and relates to the technical field of the Internet. The method comprises the following steps: firstly, distributing URL tasks to be crawled corresponding to data source websites to sub-clusters, wherein the sub-clusters are respectively concentrated in crawling of one data source website and consume bandwidth consumed by agent pools to which the sub-clusters are respectively distributed; then, in the process of crawling the data source website, acquiring the dynamic acquisition speed of each sub-cluster; and updating the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamic acquisition speed. According to the method and the device, the most reasonable cluster resource configuration can be dynamically calculated according to the acquisition speeds of different acquisition sources, an efficient distributed crawler solution is realized, and hardware resources are more reasonably utilized to improve the acquisition efficiency of the whole cluster. The method and the device are suitable for scheduling processing of the distributed crawlers.

Description

Scheduling processing method, device and equipment for distributed crawler
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, and a device for scheduling distributed crawlers.
Background
In the field of mass data acquisition, distributed crawlers have been widely used as a technical means capable of supporting high expansibility and high concurrency. In the actual use process, a distributed crawler supporting a large number of data acquisition tasks often involves the allocation of a plurality of software and hardware resources, including servers, databases, agent pools and the like.
At present, most of the existing distributed crawler systems are designed as a master-slave mode, and can be basically described as an architecture as follows: the main node manages task scheduling, including generation and deduplication of a Uniform Resource Locator (URL) queue to be crawled, data storage allocation and the like. The child nodes realize the acquisition of the crawling tasks through sockets or message queues, and inform the main node and acquire new tasks after data acquisition is completed until all the crawling tasks are completed. Such distributed crawler systems, in which child nodes initiate request tasks, have become popular in the industry.
However, in daily data development projects, since a plurality of crawlers developed by groups in a collaborative manner for different data source websites are deployed in a cluster, the conventional distributed crawler scheduling method may cause hardware resource waste and reduce data acquisition efficiency. For example, different types are divided according to data source websites, crawlers of type A are deployed on the cluster 1, and 10 machines are scheduled to collect type A data; and the crawler of type B is deployed on the cluster 2, and 5 machines are scheduled to collect type B data. After some time, type a data collection is complete while type B data is still in progress. Thus, even if the task queue is shared by the type A, B, the cluster 1 responsible for collecting type a data cannot continue to collect type B data, and time and resources are wasted.
Disclosure of Invention
In view of this, the present application provides a scheduling processing method, an apparatus, and a device for a distributed crawler, and mainly aims to solve the technical problems that hardware resources are wasted and data acquisition efficiency is reduced in a conventional distributed crawler scheduling method.
According to an aspect of the present application, a scheduling processing method for a distributed crawler is provided, the method including:
distributing URL tasks to be crawled corresponding to the data source websites to sub-clusters, wherein the sub-clusters are respectively concentrated in crawling of one data source website and consume bandwidth consumed by agent pools to which the sub-clusters are respectively distributed;
acquiring the dynamic acquisition speed of each sub-cluster in the process of crawling the data source website;
and updating the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamic acquisition speed.
According to another aspect of the present application, there is provided a scheduling processing apparatus for a distributed crawler, the apparatus including:
the data source website crawling system comprises an allocation module, a proxy pool and a crawling module, wherein the allocation module is used for allocating URL tasks to be crawled, which correspond to data source websites respectively, to sub-clusters, the sub-clusters are respectively concentrated in crawling of one data source website, and the allocated proxy pool consumes bandwidth;
the acquisition module is used for acquiring the dynamic acquisition speed of each sub-cluster in the process of crawling the data source website;
and the updating module is used for updating the number of the servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamic acquisition speed.
According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the scheduling processing method of the distributed crawler described above.
According to still another aspect of the present application, a scheduling processing apparatus for a distributed crawler is provided, which includes a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, and when the processor executes the program, the processor implements the scheduling processing method for the distributed crawler.
By means of the technical scheme, the scheduling processing method, the scheduling processing device and the scheduling processing equipment of the distributed crawler can allocate the URL tasks to be crawled corresponding to the data source websites to the sub-clusters in advance, wherein the sub-clusters are respectively concentrated in crawling of one data source website, and only the agent pools which are respectively allocated to the sub-clusters consume bandwidth. Compared with the traditional distributed crawler scheduling mode, the distributed crawler scheduling method has the advantages that the most reasonable cluster resource configuration can be dynamically calculated according to the dynamic acquisition speed of each sub-cluster in the process of crawling the data source website, the number of servers of each sub-cluster and the consumption bandwidth of the agent pool are updated, the data acquisition efficiency of each sub-cluster is balanced, the whole cluster resources are utilized to the maximum extent, the hardware resource waste is reduced, and the data acquisition efficiency is improved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flowchart illustrating a scheduling processing method for a distributed crawler according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating another scheduling processing method for a distributed crawler according to an embodiment of the present disclosure;
FIG. 3 is a flow chart illustrating an application example provided by an embodiment of the present application;
fig. 4 shows a schematic structural diagram of a scheduling processing apparatus of a distributed crawler according to an embodiment of the present application.
Detailed Description
The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The method aims to solve the technical problems that hardware resources are wasted and data acquisition efficiency is reduced in the conventional distributed crawler scheduling mode. The embodiment provides a scheduling processing method for a distributed crawler, as shown in fig. 1, the method includes:
101. and distributing the URL tasks to be crawled corresponding to the data source websites to the sub-clusters.
Wherein, the whole cluster can be divided into a plurality of sub-clusters (sub-nodes in the distributed crawler system), each sub-cluster is respectively dedicated to the crawling of one data source website, and the consumption of the agent pool to which each sub-cluster is allocated consumes the bandwidth. I.e., each sub-cluster individually performs one type of crawling task. In this embodiment, the crawling task type can be flexibly determined according to a data source and a development mode, and common types include a static webpage, a dynamic webpage, an REST interface and the like.
The execution subject of this embodiment may be a device or an apparatus for scheduling processing of the distributed crawler, and may be specifically configured on the master node side of the distributed crawler system. The main node can be used for calculating resource requirements and crawling time corresponding to various types of crawling tasks in advance, dynamically allocating the resource requirements and the crawling time to child nodes in the cluster according to changes of the tasks to complete data acquisition, and specifically executing the processes shown in the steps 102 to 103.
102. And in the process of crawling the data source website, acquiring the dynamic acquisition speed of each sub-cluster.
For example, each sub-cluster starts the collection work for a specific data source website, and monitors the dynamic collection speed of each sub-cluster according to a certain time period, for example, the dynamic collection speed can be determined by parameters such as the total amount of sent requests and the consumed time length corresponding to each sub-cluster. And then sending the acquired dynamic acquisition speed of each sub-cluster to the main node side.
103. And updating the number of servers and the consumed bandwidth of the agent pool of each sub-cluster according to the dynamic acquisition speed of each sub-cluster.
For example, based on the example provided in the background, a crawler of type a is deployed on sub-cluster 1, and 10 machines are scheduled to collect type a data (a data source website); and the crawler of type B is deployed on the sub-cluster 2, and 5 machines are scheduled to collect type B data (B data source websites). If the traditional distributed crawler scheduling mode is adopted, after a period of time, the type A data collection is completed and the type B data is still in progress. Thus, even if the type A, B shares the task queue, the sub-cluster 1 responsible for collecting type a data cannot continue to collect type B data, and time and resources are wasted. By adopting the distributed crawler scheduling processing method in the embodiment, if the acquisition speed of the sub-cluster 1 responsible for acquiring the data source website a exceeds the expectation, the main node can timely reduce the number of servers of the sub-cluster 1, and the saved resources can be allocated to other sub-clusters to improve the overall acquisition efficiency. The corresponding required proxy bandwidth may also be allocated in real-time based on the server data.
According to the scheduling processing method of the distributed crawler, the URL tasks to be crawled corresponding to the data source websites can be allocated to the sub-clusters in advance, wherein the sub-clusters are respectively dedicated to crawling of one data source website, and only the bandwidth consumed by the allocated agent pools is consumed. Compared with the traditional distributed crawler scheduling mode, in the crawling process of the data source website, the most reasonable cluster resource configuration can be dynamically calculated according to the dynamic acquisition speed of each sub-cluster, the number of servers of each sub-cluster and the consumption bandwidth of the agent pool are updated, the data acquisition efficiency of each sub-cluster is balanced, the whole cluster resources are utilized to the maximum extent, the hardware resource waste is reduced, and the data acquisition efficiency is improved.
Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully describe the implementation of this embodiment, this embodiment further provides another scheduling processing method for a distributed crawler, as shown in fig. 2, where the method includes:
201. and acquiring the estimated acquisition speed and the number of URLs to be crawled corresponding to the data source websites.
In this embodiment, in order to more reasonably allocate resources of the distributed crawler system before crawling the data source websites, corresponding allocation may be performed according to the estimated acquisition speed of each data source website and the number of URLs to be crawled, and the processes shown in steps 202 to 203 may be specifically performed.
As an alternative, with one of the data source websitesTaking a data source website (i.e., a target data source website) as an example, acquiring an estimated acquisition speed corresponding to the target data source website may specifically include: firstly, sending an acquisition test request to a target data source website; and then determining the estimated acquisition speed of the target data source website according to the total test request amount and the test duration. For example, the initial acquisition rate may be derived from testing,
Figure BDA0002415680830000051
by the method, the estimated acquisition speed corresponding to the data source website can be accurately acquired, so that when the distributed crawler system resources are distributed as reference, the initial system resource distribution is more reasonable, the subsequent dynamic update frequency is reduced, and the dynamic adjustment resources are saved.
202. And calculating the number of servers and the consumed bandwidth of the agent pool required by the crawled data source website according to the acquired estimated acquisition speed and the number of URLs to be crawled.
As an optional manner, taking one of the data source websites (i.e., the target data source website) as an example, step 202 may specifically include: firstly, determining the total quantity of the estimation request of the collection task of a target data source website according to the quantity of URLs to be crawled of the target data source website; then, estimating the total amount of the request and the estimated acquisition speed of the target data source website according to the acquisition tasks, and calculating the number of target servers required by crawling the target data source website by combining the planned acquisition time; and finally, distributing the consumed bandwidth of the agent pool required by crawling the target data source website according to the number of the target servers.
For example,
Figure BDA0002415680830000052
then distributing agent pool consumed bandwidth required by crawling of the target data source website according to the calculated number of the servers, wherein the more the number of the servers is, the more the agent pool consumed bandwidth distributed to the target data source website is; and the fewer the number of servers, the less bandwidth is consumed by the pool of proxies to which the target data source web site is assigned. By the method, the method can accurately calculate and obtain the crawling requirement of each data source websiteThe number of servers and the agent pool consume bandwidth, so that when distributed crawler system resources are distributed as reference, the initial system resource distribution is more reasonable, the subsequent dynamic update frequency is reduced, and dynamic adjustment resources are saved.
203. And allocating the corresponding number of servers to form each sub-cluster according to the calculated number of the servers, and allocating the agent pool of each sub-cluster to consume the bandwidth.
Each sub-cluster is equivalent to a sub-node in the distributed crawler system, each sub-node is respectively dedicated to crawling of one data source website, and only the consumption bandwidth of the respectively allocated agent pool is consumed.
Through the mode of creating the child nodes, resources of the distributed crawler system are more reasonably distributed before the data source website is crawled, the frequency of subsequent dynamic updating is reduced, and dynamic adjustment resources are saved.
204. And distributing the URL tasks to be crawled corresponding to the data source websites to the sub-clusters.
And each sub-cluster executes the distributed URL task to be crawled, namely, the sub-cluster starts to crawl the data source website appointed by each sub-cluster.
205. And in the process of crawling the data source website, acquiring the dynamic acquisition speed of each sub-cluster.
As an alternative, taking one of the sub-clusters (i.e. the target sub-cluster) as an example, acquiring the dynamic acquisition speed of the target sub-cluster may specifically include: firstly, acquiring the total amount of requests sent by a target sub-cluster according to a corresponding data source website and long consumed time; and then determining the dynamic acquisition speed of the target sub-cluster according to the total amount of the transmitted requests and the consumed time.
For example,
Figure BDA0002415680830000061
and feeding back the real-time dynamic acquisition speed of the target sub-cluster to the master node according to a certain time period so that the master node is updated as a reference. By the method, the dynamic acquisition speed of each sub-cluster can be accurately acquired, so that the distribution can be accurately updated subsequentlyAnd (4) each child node resource in the crawler system.
206. And updating the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the acquired dynamic acquisition speed.
To illustrate the specific implementation process of step 206, as an alternative, step 206 may specifically include: if the dynamic acquisition speed of the first sub-cluster is greater than the corresponding estimated acquisition speed, reducing the number of servers of the first sub-cluster according to the acquisition speed difference value and the corresponding preset reduction proportion (preset according to actual requirements), and reallocating the agent pool consumption bandwidth of the first sub-cluster according to the reduced number of servers; and if the dynamic acquisition speed of the second sub-cluster is smaller than the corresponding pre-estimated acquisition speed, increasing the number of the servers of the second sub-cluster by using the removed servers in the first sub-cluster according to the corresponding preset increase proportion (preset according to actual requirements) according to the acquisition speed difference, and reallocating the consumed bandwidth of the agent pool of the second sub-cluster according to the increased number of the servers.
For example, if the dynamic acquisition speed of the sub-cluster a is greater than the corresponding estimated acquisition speed, the number of servers of the sub-cluster a is reduced according to the acquisition speed difference value and the corresponding reduction proportion, and the bandwidth consumed by the agent pool of the sub-cluster a is redistributed according to the reduced number of servers. And if the dynamic acquisition speed of the sub-cluster b is smaller than the corresponding estimated acquisition speed, increasing the number of the servers of the sub-cluster b by utilizing the removed servers in the sub-cluster a, the updated removed servers of other sub-clusters, other idle servers in the distributed crawler system and the like according to the corresponding proportion of the acquisition speed difference, and redistributing the agent pool consumption bandwidth of the sub-cluster b according to the increased number of the servers.
Through the dynamic updating mode, the most reasonable cluster resource configuration is dynamically calculated according to the acquisition speeds of different acquisition sources, an efficient distributed crawler solution is realized, and hardware resources are more reasonably utilized to improve the acquisition efficiency of the whole cluster.
Further, in order to fully utilize resources of the distributed crawler system, optionally, after determining that the URL tasks to be crawled corresponding to the sub-clusters are completely executed, if the URL tasks to be crawled of the new data source websites exist, allocating the URL tasks to be crawled of the new data source websites to the sub-clusters where the tasks are completely executed, and continuing to execute until all the URL tasks to be crawled are completely executed. By the method, resources of the distributed crawler system can be efficiently utilized, and the efficiency of crawling the data source website is maximized.
In order to illustrate the specific implementation process of the above embodiments and combine the problems in the prior art, the following application examples are given, but not limited to:
for the scheduling mode of the existing distributed crawler system, as a plurality of crawlers developed by groups in a cooperative manner and aiming at different data sources are deployed in a cluster, the traditional design can cause some situations of business conflict or hardware resource waste. In the embodiment, in order to solve the problem of fully utilizing server resources, a design that a master node actively allocates tasks is adopted. And (4) finishing crawling consumed resources of each data source through calculation in advance, and dynamically updating instructions and tasks to child nodes in charge of crawling. The whole system is uniformly scheduled by the main node, so that the whole cluster resource is utilized to the maximum extent, and the data acquisition efficiency is improved. The flow shown in fig. 3 may be specifically executed:
(1) and the main node calculates the scale of each sub-cluster and related resource consumption, such as the number of servers required by crawling each website and the consumption bandwidth of an agent pool, according to the estimated acquisition speed of each data source website and the number of URLs to be crawled.
(2) And (3) the main node allocates a corresponding number of servers according to the calculation result in the step (1) to form each sub-cluster, each sub-cluster is focused on crawling of one data source, and only the allocated agent bandwidth is consumed.
(3) And the main node distributes the URL to be crawled of each data source to each sub-cluster through the message queue.
(4) And the sub-cluster starts the acquisition work aiming at a specific data source and feeds back the acquisition speed to the main node according to a certain time period.
(5) And the main node updates the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamically collected collection speed of each sub-cluster. And (4) if all the collection tasks are finished, continuing, otherwise, returning to the step (2).
(6) And (3) completing all the acquisition tasks, releasing the allocation of all the sub-clusters by the main node, receiving a new round of data source, and returning to the step (1) to repeat the process.
Distributed crawlers are mass data collection sharps in the big data era. By utilizing a reasonable task allocation and resource scheduling mechanism, the elastically expandable distributed cluster has higher data acquisition efficiency. The embodiment provides a distributed crawler system for dynamically allocating crawling tasks, which has the characteristics of expandability and high concurrency, and simultaneously utilizes hardware resources more reasonably to improve the acquisition efficiency of the whole cluster. It is equivalent to proposing a distributed crawler solution that makes more efficient use of cluster resources.
Further, as a specific implementation of the method shown in fig. 1 and fig. 2, this embodiment further provides a scheduling processing apparatus for distributed crawlers, as shown in fig. 4, the apparatus includes: an allocation module 31, an acquisition module 32, and an update module 33.
The allocating module 31 may be configured to allocate to each sub-cluster, respective URL tasks to be crawled corresponding to data source websites, where the sub-clusters are respectively dedicated to crawling of one data source website and consume bandwidth consumed by the agent pools to which they are respectively allocated;
the acquisition module 32 is configured to acquire a dynamic acquisition speed of each sub-cluster in a process of crawling the data source website;
and the updating module 33 is configured to update the number of servers of each sub-cluster and the bandwidth consumed by the proxy pool according to the dynamic acquisition speed.
In a specific application scenario, the apparatus further comprises: a calculation module and a deployment module;
the obtaining module 32 may be further configured to obtain estimated acquisition speeds and numbers of URLs to be crawled corresponding to the data source websites before distributing the URL tasks to be crawled corresponding to the data source websites to the respective sub-clusters;
the calculation module is used for calculating the number of servers and the consumed bandwidth of the agent pool required by crawling the data source website according to the estimated acquisition speed and the number of the URLs to be crawled;
and the allocating module is used for allocating the corresponding number of servers to form each sub-cluster according to the calculated number of the servers, and allocating the consumed bandwidth of the agent pool of each sub-cluster.
In a specific application scenario, the updating module 33 is specifically configured to reduce the number of servers of the first sub-cluster according to a corresponding preset reduction ratio according to a collection speed difference value if the dynamic collection speed of the first sub-cluster is greater than the corresponding estimated collection speed, and reallocate the bandwidth consumed by the proxy pool of the first sub-cluster according to the reduced number of servers; and if the dynamic acquisition speed of the second sub-cluster is smaller than the corresponding pre-estimated acquisition speed, increasing the number of the servers of the second sub-cluster by using the removed servers in the first sub-cluster according to the corresponding preset increase proportion according to the acquisition speed difference, and reallocating the consumed bandwidth of the proxy pool of the second sub-cluster according to the increased number of the servers.
In a specific application scenario, the obtaining module 32 may be specifically configured to send an acquisition test request to a target data source website; and determining the estimated acquisition speed of the target data source website according to the total test request amount and the test duration.
In a specific application scenario, the calculation module is specifically used for determining the total estimated request amount of the collection task of the target data source website according to the number of URLs to be crawled of the target data source website; calculating the number of target servers required for crawling the target data source website according to the estimated request total amount of the acquisition tasks and the estimated acquisition speed of the target data source website and by combining with the planned acquisition time; and distributing the consumed bandwidth of the agent pool required by crawling the target data source website according to the number of the target servers.
In a specific application scenario, the obtaining module 32 may be specifically configured to obtain the total amount of requests sent by the target sub-cluster according to the corresponding data source website and the consumed time is long; and determining the dynamic acquisition speed of the target sub-cluster according to the total amount of the sent requests and the consumed time.
In a specific application scenario, the allocating module 31 may be further configured to, after determining that the to-be-crawled URL task corresponding to the sub-cluster is completely executed, allocate the to-be-crawled URL task of the new data source website to the sub-cluster in which the task is completely executed to continue executing if the to-be-crawled URL task of the new data source website exists.
It should be noted that other corresponding descriptions of the functional units related to the scheduling processing apparatus for a distributed crawler provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.
Based on the methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the scheduling processing method for the distributed crawler shown in fig. 1 and fig. 2.
Based on such understanding, the technical solution of the present embodiment may be embodied in the form of a software product, where the software product may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments of the present application.
Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 4, in order to achieve the above object, this embodiment further provides a scheduling processing device for a distributed crawler, which may specifically be a personal computer, a server, a tablet computer, a smart phone, a smart watch, a smart bracelet, or other network devices, and the device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the scheduling processing method of the distributed crawler as shown in fig. 1 and 2.
Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.
It will be understood by those skilled in the art that the above-described physical device structure provided in the present embodiment is not limited to the physical device, and may include more or less components, or combine some components, or arrange different components.
The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the above-described physical devices, and supports the operation of the information processing program as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme of the embodiment, the URL tasks to be crawled corresponding to the data source websites can be distributed to the sub-clusters in advance, wherein the sub-clusters are respectively concentrated in crawling of one data source website, and only the consumption bandwidth of the distributed agent pools is consumed. Compared with the traditional distributed crawler scheduling mode, in the crawling process of the data source website, the most reasonable cluster resource configuration can be dynamically calculated according to the dynamic acquisition speed of each sub-cluster, the number of servers of each sub-cluster and the consumption bandwidth of the agent pool are updated, the data acquisition efficiency of each sub-cluster is balanced, the whole cluster resources are utilized to the maximum extent, the hardware resource waste is reduced, and the data acquisition efficiency is improved.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (10)

1. A scheduling processing method of a distributed crawler is characterized by comprising the following steps:
distributing URL tasks to be crawled corresponding to the data source websites to sub-clusters, wherein the sub-clusters are respectively concentrated in crawling of one data source website and consume bandwidth consumed by agent pools to which the sub-clusters are respectively distributed;
acquiring the dynamic acquisition speed of each sub-cluster in the process of crawling the data source website;
and updating the number of servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamic acquisition speed.
2. The method according to claim 1, wherein before distributing the URL tasks to be crawled corresponding to the data source websites to the respective sub-clusters, the method further comprises:
acquiring the estimated acquisition speed and the number of URLs to be crawled corresponding to the data source websites;
according to the estimated acquisition speed and the number of the URLs to be crawled, calculating the number of servers and the consumed bandwidth of an agent pool required by crawling of the data source website;
and allocating the corresponding number of servers to form each sub-cluster according to the calculated number of the servers, and allocating the bandwidth consumed by the agent pool of each sub-cluster.
3. The method according to claim 2, wherein updating the number of servers and the bandwidth consumed by the proxy pool of each of the sub-clusters according to the dynamic collection speed specifically comprises:
if the dynamic acquisition speed of the first sub-cluster is greater than the corresponding estimated acquisition speed, reducing the number of servers of the first sub-cluster according to the acquisition speed difference value and the corresponding preset reduction proportion, and reallocating the agent pool consumption bandwidth of the first sub-cluster according to the reduced number of servers;
and if the dynamic acquisition speed of the second sub-cluster is smaller than the corresponding pre-estimated acquisition speed, increasing the number of the servers of the second sub-cluster by using the removed servers in the first sub-cluster according to the corresponding preset increase proportion according to the acquisition speed difference, and reallocating the consumed bandwidth of the proxy pool of the second sub-cluster according to the increased number of the servers.
4. The method according to claim 2, wherein the obtaining of the estimated acquisition speed corresponding to each of the data source websites specifically comprises:
sending a collection test request to a target data source website;
and determining the estimated acquisition speed of the target data source website according to the total test request amount and the test duration.
5. The method according to claim 2, wherein the calculating, according to the estimated acquisition speed and the number of URLs to be crawled, the number of servers and the consumed bandwidth of the agent pool required for crawling the data source websites respectively comprises:
determining the total quantity of the collection task prediction requests of a target data source website according to the quantity of URLs to be crawled of the target data source website;
calculating the number of target servers required for crawling the target data source website according to the estimated request total amount of the acquisition tasks and the estimated acquisition speed of the target data source website and by combining with the planned acquisition time;
and distributing the consumed bandwidth of the agent pool required by crawling the target data source website according to the number of the target servers.
6. The method according to any one of claims 1 to 5, wherein the obtaining the dynamic acquisition speed of each of the sub-clusters specifically comprises:
acquiring the total amount of requests sent by a target subset group according to a corresponding data source website and the consumed time is long;
and determining the dynamic acquisition speed of the target sub-cluster according to the total amount of the sent requests and the consumed time.
7. The method according to any one of claims 1 to 6, further comprising:
after the URL task to be crawled corresponding to the subset group is determined to be executed, if the URL task to be crawled of a new data source website exists, the URL task to be crawled of the new data source website is allocated to the subset group of which the task is executed to be executed continuously.
8. A dispatch processing apparatus of distributed crawler, comprising:
the data source website crawling system comprises an allocation module, a proxy pool and a crawling module, wherein the allocation module is used for allocating URL tasks to be crawled, which correspond to data source websites respectively, to sub-clusters, the sub-clusters are respectively concentrated in crawling of one data source website, and the allocated proxy pool consumes bandwidth;
the acquisition module is used for acquiring the dynamic acquisition speed of each sub-cluster in the process of crawling the data source website;
and the updating module is used for updating the number of the servers of each sub-cluster and the consumed bandwidth of the agent pool according to the dynamic acquisition speed.
9. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the scheduling processing method of a distributed crawler according to any one of claims 1 to 7.
10. A scheduling processing apparatus for a distributed crawler, comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the scheduling processing method for a distributed crawler according to any one of claims 1 to 7 when executing the program.
CN202010190446.0A 2020-03-18 2020-03-18 Scheduling processing method, device and equipment for distributed crawler Pending CN111522654A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010190446.0A CN111522654A (en) 2020-03-18 2020-03-18 Scheduling processing method, device and equipment for distributed crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010190446.0A CN111522654A (en) 2020-03-18 2020-03-18 Scheduling processing method, device and equipment for distributed crawler

Publications (1)

Publication Number Publication Date
CN111522654A true CN111522654A (en) 2020-08-11

Family

ID=71901839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010190446.0A Pending CN111522654A (en) 2020-03-18 2020-03-18 Scheduling processing method, device and equipment for distributed crawler

Country Status (1)

Country Link
CN (1) CN111522654A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506654A (en) * 2020-12-07 2021-03-16 中国船舶重工集团公司第七一六研究所 Industrial robot distributed collaborative debugging method and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080304411A1 (en) * 2007-06-05 2008-12-11 Oki Electric Industry Co., Ltd. Bandwidth control system and method capable of reducing traffic congestion on content servers
CN102902669A (en) * 2011-07-22 2013-01-30 同程网络科技股份有限公司 Distribution information capturing method based on internet system
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN107071009A (en) * 2017-03-28 2017-08-18 江苏飞搏软件股份有限公司 A kind of distributed big data crawler system of load balancing
CN107291824A (en) * 2017-05-25 2017-10-24 北京小度信息科技有限公司 Data grab method and device
CN107562541A (en) * 2017-09-05 2018-01-09 广东科杰通信息科技有限公司 A kind of distributed reptile method of load balancing, crawler system
CN110062025A (en) * 2019-03-14 2019-07-26 深圳绿米联创科技有限公司 Method, apparatus, server and the storage medium of data acquisition
CN110147271A (en) * 2019-05-15 2019-08-20 重庆八戒传媒有限公司 Promote the method, apparatus and computer readable storage medium of crawler agent quality
CN110457555A (en) * 2019-06-24 2019-11-15 平安国际智慧城市科技股份有限公司 Collecting method, device and computer equipment, storage medium based on Docker
CN110516139A (en) * 2019-09-05 2019-11-29 上海携程商务有限公司 Crawler system and method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080304411A1 (en) * 2007-06-05 2008-12-11 Oki Electric Industry Co., Ltd. Bandwidth control system and method capable of reducing traffic congestion on content servers
CN102902669A (en) * 2011-07-22 2013-01-30 同程网络科技股份有限公司 Distribution information capturing method based on internet system
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN107071009A (en) * 2017-03-28 2017-08-18 江苏飞搏软件股份有限公司 A kind of distributed big data crawler system of load balancing
CN107291824A (en) * 2017-05-25 2017-10-24 北京小度信息科技有限公司 Data grab method and device
CN107562541A (en) * 2017-09-05 2018-01-09 广东科杰通信息科技有限公司 A kind of distributed reptile method of load balancing, crawler system
CN110062025A (en) * 2019-03-14 2019-07-26 深圳绿米联创科技有限公司 Method, apparatus, server and the storage medium of data acquisition
CN110147271A (en) * 2019-05-15 2019-08-20 重庆八戒传媒有限公司 Promote the method, apparatus and computer readable storage medium of crawler agent quality
CN110457555A (en) * 2019-06-24 2019-11-15 平安国际智慧城市科技股份有限公司 Collecting method, device and computer equipment, storage medium based on Docker
CN110516139A (en) * 2019-09-05 2019-11-29 上海携程商务有限公司 Crawler system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506654A (en) * 2020-12-07 2021-03-16 中国船舶重工集团公司第七一六研究所 Industrial robot distributed collaborative debugging method and system
CN112506654B (en) * 2020-12-07 2023-07-18 中国船舶集团有限公司第七一六研究所 Distributed collaborative debugging method and system for industrial robot

Similar Documents

Publication Publication Date Title
CN106776005B (en) Resource management system and method for containerized application
US20200328984A1 (en) Method and apparatus for allocating resource
CN105100267B (en) The deployment device and method of large enterprises' private clound
Zeng et al. An integrated task computation and data management scheduling strategy for workflow applications in cloud environments
CN111897638B (en) Distributed task scheduling method and system
CN109117252B (en) Method and system for task processing based on container and container cluster management system
CN112449750A (en) Log data collection method, log data collection device, storage medium, and log data collection system
CN106462593B (en) System and method for massively parallel processing of databases
CN106897299B (en) Database access method and device
CN109725991B (en) Task processing method, device and equipment and readable storage medium
CN103761146A (en) Method for dynamically setting quantities of slots for MapReduce
CN116450355A (en) Multi-cluster model training method, device, equipment and medium
CN104077188A (en) Method and device for scheduling tasks
CN105740085A (en) Fault tolerance processing method and device
CN111427551A (en) User code operation method of programming platform, equipment and storage medium
CN111179008B (en) Information state updating method, device, system and storage medium
Kim et al. Adaptive job allocation scheduler based on usage pattern for computing offloading of IoT
Zhang et al. Meteor: Optimizing spark-on-yarn for short applications
Sanches et al. Data-centric distributed computing on networks of mobile devices
Tran et al. A new data layout scheme for energy-efficient MapReduce processing tasks
CN111597035A (en) Simulation engine time advancing method and system based on multiple threads
CN111522654A (en) Scheduling processing method, device and equipment for distributed crawler
Lee A framework for seamless execution of mobile applications in the cloud
Sanabria et al. New heuristics for scheduling and distributing jobs under hybrid dew computing environments
Singh et al. A priority heuristic policy in mobile distributed real-time database system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination