WO2019061384A1 - 分布式爬虫系统中任务管理器的选举方法及系统 - Google Patents

分布式爬虫系统中任务管理器的选举方法及系统 Download PDF

Info

Publication number
WO2019061384A1
WO2019061384A1 PCT/CN2017/104724 CN2017104724W WO2019061384A1 WO 2019061384 A1 WO2019061384 A1 WO 2019061384A1 CN 2017104724 W CN2017104724 W CN 2017104724W WO 2019061384 A1 WO2019061384 A1 WO 2019061384A1
Authority
WO
WIPO (PCT)
Prior art keywords
distributed
task manager
task
message
crawler
Prior art date
Application number
PCT/CN2017/104724
Other languages
English (en)
French (fr)
Inventor
马岩
Original Assignee
麦格创科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 麦格创科技(深圳)有限公司 filed Critical 麦格创科技(深圳)有限公司
Priority to PCT/CN2017/104724 priority Critical patent/WO2019061384A1/zh
Publication of WO2019061384A1 publication Critical patent/WO2019061384A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of data processing, and in particular, to a method and system for electing a task manager in a distributed crawler system.
  • Web crawlers also known as web spiders, web bots, more often referred to as web chasers in the FOAF community
  • Web crawlers are programs or scripts that automatically crawl web information in accordance with certain rules.
  • Other infrequently used names are ants, automatic indexes, simulators, or worms.
  • the web crawler is actually an application for crawling network information.
  • the existing web crawler grabs a large amount of data, and the assigned task manager of the task is randomly allocated, which may affect the efficiency of task assignment and affect the efficiency of the crawler.
  • the application provides a method for electing a task manager in a distributed crawler system. It solves the shortcomings of the prior art technical solutions.
  • a distributed crawler task assignment method comprising the following steps:
  • the distributed device receives or initiates an election message for electing a task manager from the distributed crawler system; the distributed device broadcasts the device parameters to other devices in the distributed crawler system through broadcast messages, and receives other devices
  • the broadcast message is sent, and the first distributed device with the optimal device parameter is extracted from the broadcast message; the distributed device receives the voting message sent by the other device, and the voting message includes: the number of votes and the distributed device of the voting, and the distribution of the maximum number of votes
  • the device is determined to be a task manager; for example, the distributed device is a task manager, and the locally processed crawler task is assigned to other distributed devices.
  • the method further includes:
  • the distributed device with the second highest number of votes is determined as the standby task manager, and the task processing threshold of the standby task manager is lowered.
  • the method further includes:
  • a distributed crawler task distribution system comprising: a plurality of distributed devices;
  • a distributed device configured to receive or initiate an election message, where the election message is used to elect a task manager from a distributed crawler system; broadcast device parameters to other devices in the distributed crawler system through broadcast messages, and receive other devices Sending a broadcast message, extracting, from the broadcast message, a first distributed device with optimal device parameters; receiving a voting message sent by another device, the voting message including: a number of votes and a distributed distributed device, determining the distributed device with the largest number of votes For the task manager;
  • the distributed device is a task manager
  • the locally processed crawler task is assigned to other distributed devices.
  • the distributed device is further configured to determine, as the standby task manager, the distributed device with the second highest number of votes, and reduce the task processing threshold of the standby task manager.
  • the distributed device is further used to start the standby task manager as a task manager of the distributed system, such as a task manager failure.
  • a distributed device including: a processor, a wireless transceiver, a memory, and a bus, wherein the processor, the wireless transceiver, and the memory are connected by a bus.
  • the wireless transceiver is configured to receive or initiate an election message, where the election message is used to elect a task manager from a distributed crawler system;
  • the processor is configured to broadcast device parameters to other devices in the distributed crawler system through a broadcast message, receive broadcast messages sent by other devices, and extract, from the broadcast message, a first distributed device with optimal device parameters; receive other A voting message sent by the device, the voting message includes: the number of votes and the distributed device of the voting, and the distributed device with the largest number of votes is determined as the task manager; if the distributed device is the task manager, the locally processed crawling task is assigned to the other Distributed device.
  • the processor is configured to determine a distributed device with the second highest number of votes as an alternate task manager, and reduce a task processing threshold of the standby task manager.
  • the processor is configured to start the standby task manager as a task manager of the distributed system, such as a task manager failure.
  • a computer readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method provided by the first aspect.
  • the technical solution provided by the present invention elects a parameter-optimized device from a plurality of distributed devices as a task manager by an election method, does not process a crawler task when becoming a task manager, and allocates a locally processed crawler task to the task manager.
  • Other distributed devices such as the rapid allocation of crawler tasks, improve efficiency.
  • FIG. 1 is a flowchart of a method for electing a task manager in a distributed crawler system according to a first preferred embodiment of the present invention
  • FIG. 2 is a structural diagram of an election system of a task manager in a distributed crawler system according to a second preferred embodiment of the present invention.
  • FIG. 3 is a hardware structural diagram of a distributed device according to a second preferred embodiment of the present invention.
  • FIG. 1 is a method for electing a task manager in a distributed crawler system according to a first preferred embodiment of the present invention. The method is as shown in FIG.
  • Step S101 The distributed device receives or initiates an election message, and the election message is used to elect a task manager from the distributed crawler system.
  • Step S102 The distributed device broadcasts the device parameter to other devices in the distributed crawler system through a broadcast message, receives the broadcast message sent by the other device, and extracts the first distributed device with the optimal device parameter from the broadcast message.
  • the device parameters may include: device hardware parameters, such as memory, CPU, memory parameters, and of course, may also include some variable parameters, such as the number of crawler tasks, memory usage, CPU usage, and the like.
  • Step S103 The distributed device receives a voting message sent by another device, where the voting message includes: a number of votes and a distributed device for voting, and the distributed device with the largest number of votes is determined as the task manager.
  • Step S104 If the distributed device is a task manager, the locally processed crawler task is allocated to other distributed devices.
  • the technical solution provided by the present invention elects a parameter-optimized device from a plurality of distributed devices as a task manager by an election method, does not process a crawler task when becoming a task manager, and allocates a locally processed crawler task to the task manager.
  • Other distributed devices such as the rapid allocation of crawler tasks, improve efficiency.
  • the foregoing method may further include:
  • the distributed device with the second highest number of votes is determined as the standby task manager, and the task processing threshold of the standby task manager is lowered.
  • the foregoing method may further include:
  • the foregoing method may further include:
  • the second device group allocates a crawler task by using a second load balancing algorithm, and the task threshold of the second load balancing algorithm is smaller than the task threshold of the first load balancing algorithm.
  • the consideration is that for a directly connected device, that is, only one hop count, the distance is short, the delay between the networks is small, and communication failure with the task manager is not easy, and for such a device, it is determined to be the core.
  • the first device group adopts a load balancing algorithm with a large task threshold to implement task allocation.
  • the far-reaching device has a long hop count, a long distance, and a short delay between networks, which is prone to communication failure. , so the number of tasks assigned should be less.
  • the foregoing method may further include:
  • the heartbeat message between the device and the task manager is detected. If the first heartbeat message of the first device is not received within the set time, the crawling task that is not completed by the first device is determined as the crawling task to be allocated.
  • FIG. 2 is a distributed crawler implementation system according to a second preferred embodiment of the present invention.
  • the system as shown in FIG. 2, includes: a distributed device 201 and a distributed device 201, and the task management Connected to the device;
  • a distributed device configured to receive or initiate an election message, where the election message is used to elect a task manager from a distributed crawler system; broadcast device parameters to other devices in the distributed crawler system through broadcast messages, and receive other devices Sending a broadcast message, extracting, from the broadcast message, a first distributed device with optimal device parameters; receiving a voting message sent by another device, the voting message including: a number of votes and a distributed distributed device, determining the distributed device with the largest number of votes For the task manager;
  • the distributed device is a task manager
  • the locally processed crawler task is assigned to other distributed devices.
  • the distributed device is further configured to determine, as the standby task manager, the distributed device with the second highest number of votes, and reduce the task processing threshold of the standby task manager.
  • the distributed device is further used to start the standby task manager as a task manager of the distributed system, such as a task manager failure.
  • FIG. 3 is a distributed device 30, including: a processor 301, a wireless transceiver 302, a memory 303, and a bus 304.
  • the wireless transceiver 302 is configured to send and receive data with and from an external device.
  • the number of processors 301 can be one or more.
  • processor 301, memory 302, and transceiver 303 may be connected by bus 304 or other means.
  • Server 30 can be used to perform the steps of FIG. For the meaning and examples of the terms involved in the embodiment, reference may be made to the corresponding embodiment of FIG. 1. I will not repeat them here.
  • the wireless transceiver 302 is configured to acquire the crawler task, obtain the distance of the device connected to the task manager, and the number of crawler tasks.
  • the program code is stored in the memory 303.
  • the processor 901 is configured to call the program code stored in the memory 903 for performing the following operations:
  • the processor 301 is configured to allocate a crawler task to the device according to the distance and the number of crawler tasks.
  • the processor 301 herein may be a processing component or a general term of multiple processing components.
  • the processing element can be a central processor (Central) Processing Unit, CPU), or a specific integrated circuit (Application Specific Integrated) Circuit, ASIC), or one or more integrated circuits configured to implement embodiments of the present application, such as one or more microprocessors (digital singnal Processor, DSP), or one or more Field Programmable Gate Arrays (FPGAs).
  • CPU central processor
  • ASIC Application Specific Integrated Circuit
  • DSP digital singnal Processor
  • FPGAs Field Programmable Gate Arrays
  • the memory 303 may be a storage device or a collective name of a plurality of storage elements, and is used to store executable program code or parameters, data, and the like required for the application running device to operate. And the memory 303 may include random access memory (RAM), and may also include non-volatile memory (non-volatile memory) Memory), such as disk storage, flash (Flash), etc.
  • RAM random access memory
  • non-volatile memory non-volatile memory
  • flash flash
  • Bus 304 can be an industry standard architecture (Industry Standard Architecture, ISA) bus, Peripheral Component (PCI) bus or extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc.
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 3, but it does not mean that there is only one bus or one type of bus.
  • the terminal may further include input and output means connected to the bus 304 for connection to other parts such as the processor 301 via the bus.
  • the input/output device can provide an input interface for the operator, so that the operator can select the control item through the input interface, and can also be other interfaces through which other devices can be externally connected.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: Flash drive, read-only memory (English: Read-Only Memory, referred to as: ROM), random accessor (English: Random Access Memory, referred to as: RAM), disk or CD.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

一种分布式爬虫系统中任务管理器的选举方法,所述方法包括:分布式设备接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器(101);分布式设备将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备(102);分布式设备接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器(103);如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备(104)。所述方法能够较高效率地在分布式系统中选举出任务管理器。

Description

分布式爬虫系统中任务管理器的选举方法及系统 技术领域
本发明涉及数据处理领域,尤其涉及一种分布式爬虫系统中任务管理器的选举方法及系统。
背景技术
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。
网络爬虫实际是一种网络信息抓取的应用程序,现有的网络爬虫抓取数据量大,任务的分配的任务管理器为随机分配的,其可能影响任务分配的效率,影响爬虫的效率。
技术问题
本申请提供一种分布式爬虫系统中任务管理器的选举方法。其解决现有技术的技术方案效率低的缺点。
技术解决方案
一方面,提供一种分布式爬虫任务分配方法,所述方法包括如下步骤:
分布式设备接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器;分布式设备将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备;分布式设备接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器;如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备。
可选的,所述方法还包括:
将票数第二多的分布式设备确定为备用任务管理器,降低备用任务管理器的任务处理阈值。
可选的,所述方法还包括:
如任务管理器故障,启动备用任务管理器作为分布式系统的任务管理器。
第二方面,提供一种分布式爬虫任务分配系统,所述系统包括:多个分布式设备;
分布式设备,用于接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器;将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备;接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器;
如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备。
可选的,所述分布式设备,还用于将票数第二多的分布式设备确定为备用任务管理器,降低备用任务管理器的任务处理阈值。
可选的,所述分布式设备,还用于如任务管理器故障,启动备用任务管理器作为分布式系统的任务管理器。
第三方面,提供一种分布式设备,包括:处理器、无线收发器、存储器和总线,所述处理器、无线收发器、存储器通过总线连接,
所述无线收发器,用于接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器;
所述处理器,用于将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备;接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器;如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备。
可选的,所述处理器,用于将票数第二多的分布式设备确定为备用任务管理器,降低备用任务管理器的任务处理阈值。
可选的,所述处理器,用于如任务管理器故障,启动备用任务管理器作为分布式系统的任务管理器。
第四方面,提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第一方面提供的方法。
有益效果
本发明提供的技术方案通过选举的方法从多个分布式设备中选举出参数最优的设备作为做任务管理器,在成为任务管理器的时候不处理爬虫任务,将本地处理的爬虫任务分配给其他的分布式设备,这样实现对爬虫任务的快速分配,提高效率。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明第一较佳实施方式提供的一种分布式爬虫系统中任务管理器的选举方法的流程图;
图2为本发明第二较佳实施方式提供的一种分布式爬虫系统中任务管理器的选举系统的结构图。
图3为本发明第二较佳实施方式提供的一种分布式设备的硬件结构图。
本发明的实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参考图1,图1是本发明第一较佳实施方式提出的一种分布式爬虫系统中任务管理器的选举方法,该方法如图1所示,包括如下步骤:
步骤S101、分布式设备接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器。
步骤S102、分布式设备将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备。
上述设备参数可以包括:设备硬件参数,例如,内存、CPU、存储器的参数,当然还可以包括一些可变参数,例如,爬虫任务数量、内存使用率、CPU使用率等等。
步骤S103、分布式设备接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器。
步骤S104、如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备。
本发明提供的技术方案通过选举的方法从多个分布式设备中选举出参数最优的设备作为做任务管理器,在成为任务管理器的时候不处理爬虫任务,将本地处理的爬虫任务分配给其他的分布式设备,这样实现对爬虫任务的快速分配,提高效率。
可选的,上述方法还可以包括:
将票数第二多的分布式设备确定为备用任务管理器,降低备用任务管理器的任务处理阈值。
可选的,上述方法还可以包括:
如任务管理器故障,启动备用任务管理器作为分布式系统的任务管理器。
此方式避免多次选举影响进程。
可选的,上述方法还可以包括:
将与任务管理器直接连接的设备确定为第一设备组,为第一设备组采用第一负载均衡的算法分配爬虫任务,将与任务管理服务器间接连接的设备确定为第二设备组,为第二设备组采用第二负载均衡的算法分配爬虫任务,第二负载均衡的算法的任务阈值小于第一负载均衡算法的任务阈值。
其考虑的是,对于直接连接的设备,即跳数只有1个,这种距离短,网络之间的延时少,并且不易与任务管理器出现通信故障,对于此类设备,确定为最核心的第一设备组,采用任务阈值较大的负载均衡算法来实现任务的分配,离的远的设备,由于其跳数较多,距离长,网络之间的延时短,易出现通信的故障,所以分配的任务数量应该少一些。
可选的,上述方法还可以包括:
检测设备与任务管理器之间的心跳消息,如设定时间内未接收到第一设备的第一心跳消息,将第一设备未完成的爬虫任务确定为待分配的爬虫任务。
请参考图2,图2是本发明第二较佳实施方式提出的一种分布式爬虫实现系统,该系统如图2所示,包括:分布式设备201以及分布式设备201,所述任务管理器与设备连接;
分布式设备,用于接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器;将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备;接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器;
如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备。
可选的,所述分布式设备,还用于将票数第二多的分布式设备确定为备用任务管理器,降低备用任务管理器的任务处理阈值。
可选的,所述分布式设备,还用于如任务管理器故障,启动备用任务管理器作为分布式系统的任务管理器。
参阅图3,图3为一种分布式设备30,包括:处理器301、无线收发器302、存储器303和总线304,无线收发器302用于与外部设备之间收发数据。处理器301的数量可以是一个或多个。本申请的一些实施例中,处理器301、存储器302和收发器303可通过总线304或其他方式连接。服务器30可以用于执行图1的步骤。关于本实施例涉及的术语的含义以及举例,可以参考图1对应的实施例。此处不再赘述。
无线收发器302,用于获取该爬虫任务,获取与该任务管理器连接的设备的距离以及爬虫任务数量。
其中,存储器303中存储程序代码。处理器901用于调用存储器903中存储的程序代码,用于执行以下操作:
处理器301,用于依据该距离以及爬虫任务数量为该设备分配爬虫任务。
需要说明的是,这里的处理器301可以是一个处理元件,也可以是多个处理元件的统称。例如,该处理元件可以是中央处理器(Central Processing Unit,CPU),也可以是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本申请实施例的一个或多个集成电路,例如:一个或多个微处理器(digital singnal processor,DSP),或,一个或者多个现场可编程门阵列(Field Programmable Gate Array, FPGA)。
存储器303可以是一个存储装置,也可以是多个存储元件的统称,且用于存储可执行程序代码或应用程序运行装置运行所需要参数、数据等。且存储器303可以包括随机存储器(RAM),也可以包括非易失性存储器(non-volatile memory),例如磁盘存储器,闪存(Flash)等。
总线304可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外部设备互连(Peripheral Component,PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图3中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
该终端还可以包括输入输出装置,连接于总线304,以通过总线与处理器301等其它部分连接。该输入输出装置可以为操作人员提供一输入界面,以便操作人员通过该输入界面选择布控项,还可以是其它接口,可通过该接口外接其它设备。
需要说明的是,对于前述的各个方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某一些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详细描述的部分,可以参见其他实施例的相关描述。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(英文:Read-Only Memory ,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本发明实施例所提供的内容下载方法及相关设备、系统进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种分布式爬虫系统中任务管理器的选举方法,其特征在于,所述方法包括如下步骤:
    分布式设备接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器;
    分布式设备将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备;
    分布式设备接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器;
    如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    将票数第二多的分布式设备确定为备用任务管理器,降低备用任务管理器的任务处理阈值。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    如任务管理器故障,启动备用任务管理器作为分布式系统的任务管理器。
  4. 一种分布式爬虫任务分配系统,其特征在于,所述系统包括:多个分布式设备;
    分布式设备,用于接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器;将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备;接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器;
    如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备。
  5. 根据权利要求4所述的系统,其特征在于,
    所述分布式设备,还用于将票数第二多的分布式设备确定为备用任务管理器,降低备用任务管理器的任务处理阈值。
  6. 根据权利要求4所述的方法,其特征在于,
    所述分布式设备,还用于如任务管理器故障,启动备用任务管理器作为分布式系统的任务管理器。
  7. 一种分布式设备,包括:处理器、无线收发器、存储器和总线,所述处理器、无线收发器、存储器通过总线连接,其特征在于,
    所述无线收发器,用于接收或发起选举消息,所述选举消息用于从分布式爬虫系统中选举出任务管理器;
    所述处理器,用于将设备参数通过广播消息广播至分布式爬虫系统中的其他设备,接收其他设备发送的广播消息,从广播消息中提取设备参数最优的第一分布式设备;接收其他设备发送的投票消息,该投票消息包括:票数以及投票的分布式设备,将票数最多的分布式设备确定为任务管理器;如分布式设备为任务管理器,将本地处理的爬虫任务分配给其他分布式设备。
  8. 根据权利要求7所述的服务器,其特征在于,所述处理器,用于将票数第二多的分布式设备确定为备用任务管理器,降低备用任务管理器的任务处理阈值。
  9. 根据权利要求7所述的服务器,其特征在于,所述处理器,用于如任务管理器故障,启动备用任务管理器作为分布式系统的任务管理器。
  10. 一种计算机可读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-3任一项所述的方法。
PCT/CN2017/104724 2017-09-30 2017-09-30 分布式爬虫系统中任务管理器的选举方法及系统 WO2019061384A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/104724 WO2019061384A1 (zh) 2017-09-30 2017-09-30 分布式爬虫系统中任务管理器的选举方法及系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/104724 WO2019061384A1 (zh) 2017-09-30 2017-09-30 分布式爬虫系统中任务管理器的选举方法及系统

Publications (1)

Publication Number Publication Date
WO2019061384A1 true WO2019061384A1 (zh) 2019-04-04

Family

ID=65900366

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/104724 WO2019061384A1 (zh) 2017-09-30 2017-09-30 分布式爬虫系统中任务管理器的选举方法及系统

Country Status (1)

Country Link
WO (1) WO2019061384A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650570A (zh) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 可动态扩展的分布式爬虫系统、数据处理方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080137528A1 (en) * 2006-12-06 2008-06-12 Cisco Technology, Inc. Voting to establish a new network master device after a network failover
CN104767794A (zh) * 2015-03-13 2015-07-08 青岛海信传媒网络技术有限公司 一种分布式系统中的节点选举方法及节点
CN106155780A (zh) * 2015-04-02 2016-11-23 阿里巴巴集团控股有限公司 一种基于时间的节点选举方法及装置
CN106685724A (zh) * 2017-01-10 2017-05-17 网宿科技股份有限公司 基于选举的节点服务器管理方法、装置及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080137528A1 (en) * 2006-12-06 2008-06-12 Cisco Technology, Inc. Voting to establish a new network master device after a network failover
CN104767794A (zh) * 2015-03-13 2015-07-08 青岛海信传媒网络技术有限公司 一种分布式系统中的节点选举方法及节点
CN106155780A (zh) * 2015-04-02 2016-11-23 阿里巴巴集团控股有限公司 一种基于时间的节点选举方法及装置
CN106685724A (zh) * 2017-01-10 2017-05-17 网宿科技股份有限公司 基于选举的节点服务器管理方法、装置及系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650570A (zh) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 可动态扩展的分布式爬虫系统、数据处理方法及装置

Similar Documents

Publication Publication Date Title
CN101442513B (zh) 实现多种业务处理功能的方法和多核处理器设备
WO2021184551A1 (zh) 基于多个网络的通信方法、装置、电子设备及存储介质
WO2018176390A1 (zh) 绕线机的安全防备方法及系统
WO2019061384A1 (zh) 分布式爬虫系统中任务管理器的选举方法及系统
WO2018223354A1 (zh) 基于定位的考勤记录方法及系统
WO2015067051A1 (zh) 测试代理方法及其装置
WO2019061385A1 (zh) 分布式爬虫任务分配方法及系统
WO2021242000A1 (ko) 데이터 적재 및 처리 시스템 및 그 방법
WO2018223375A1 (zh) 终端流量的控制提醒方法及系统
WO2018165839A1 (zh) 分布式爬虫实现方法及系统
WO2018223371A1 (zh) 终端热点的接入控制方法及系统
WO2018176449A1 (zh) 绕线机的进度统计和分配方法及系统
WO2018170889A1 (zh) 即时通信的好友分组方法及系统
WO2018223373A1 (zh) 副号的终端管理方法及系统
WO2018209507A1 (zh) 终端app分身的实现方法及系统
WO2018209502A1 (zh) 终端app的分组方法及系统
WO2018218615A1 (zh) 终端内多个app优先级的确定方法及系统
WO2018209508A1 (zh) 终端app分身的实现方法及系统
WO2023058829A1 (ko) 인-네트워크 관리 장치, 네트워크 스위치, 인-네트워크 데이터 집약 시스템 및 방법
WO2018176223A1 (zh) 即时通信的分身实现方法及系统
WO2018209504A1 (zh) 基于分组的终端app管理方法及系统
WO2018157331A1 (zh) 应用于大数据的存储方法及系统
WO2018176447A1 (zh) 基于绕线机的灯光控制方法及系统
WO2018209550A1 (zh) 终端的系统更新方法及系统
WO2022051921A1 (zh) 网络供电系统

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17926924

Country of ref document: EP

Kind code of ref document: A1