CN112416551A - 一种分布式爬虫调度系统 - Google Patents
一种分布式爬虫调度系统 Download PDFInfo
- Publication number
- CN112416551A CN112416551A CN202011303271.6A CN202011303271A CN112416551A CN 112416551 A CN112416551 A CN 112416551A CN 202011303271 A CN202011303271 A CN 202011303271A CN 112416551 A CN112416551 A CN 112416551A
- Authority
- CN
- China
- Prior art keywords
- crawler
- scheduling
- queue
- filtering
- filter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011303271.6A CN112416551A (zh) | 2020-11-19 | 2020-11-19 | 一种分布式爬虫调度系统 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011303271.6A CN112416551A (zh) | 2020-11-19 | 2020-11-19 | 一种分布式爬虫调度系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112416551A true CN112416551A (zh) | 2021-02-26 |
Family
ID=74773056
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011303271.6A Pending CN112416551A (zh) | 2020-11-19 | 2020-11-19 | 一种分布式爬虫调度系统 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112416551A (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102982162A (zh) * | 2012-12-05 | 2013-03-20 | 北京奇虎科技有限公司 | 网页信息的获取系统 |
US20170004179A1 (en) * | 2015-06-30 | 2017-01-05 | Linkedin Corporation | Managing presentation of online content |
CN107704323A (zh) * | 2017-11-07 | 2018-02-16 | 广州探迹科技有限公司 | 一种网络爬虫任务调度方法及装置 |
CN108121743A (zh) * | 2016-11-30 | 2018-06-05 | 中移(苏州)软件技术有限公司 | 一种通用网页模版的生成和使用方法、系统 |
CN110020062A (zh) * | 2019-04-12 | 2019-07-16 | 北京邮电大学 | 一种可定制的网络爬虫方法及系统 |
-
2020
- 2020-11-19 CN CN202011303271.6A patent/CN112416551A/zh active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102982162A (zh) * | 2012-12-05 | 2013-03-20 | 北京奇虎科技有限公司 | 网页信息的获取系统 |
US20170004179A1 (en) * | 2015-06-30 | 2017-01-05 | Linkedin Corporation | Managing presentation of online content |
CN108121743A (zh) * | 2016-11-30 | 2018-06-05 | 中移(苏州)软件技术有限公司 | 一种通用网页模版的生成和使用方法、系统 |
CN107704323A (zh) * | 2017-11-07 | 2018-02-16 | 广州探迹科技有限公司 | 一种网络爬虫任务调度方法及装置 |
CN110020062A (zh) * | 2019-04-12 | 2019-07-16 | 北京邮电大学 | 一种可定制的网络爬虫方法及系统 |
Non-Patent Citations (3)
Title |
---|
FAN SHAN-SHAN: "Distributed multi-topic Web crawler based on priority queue", 《COMPUTER ENGINEERING AND DESIGN》 * |
张震: "基于LRU-BF策略的网络流量测量算法", 《通信学报》 * |
樊宇豪: "基于Scrapy的分布式网络爬虫系统设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6968292B2 (en) | Industrial controller event logging | |
JP3812236B2 (ja) | イベント制御手段を備えたネットワーク管理システム | |
EP3932025B1 (en) | Computing resource scheduling method, scheduler, internet of things system, and computer readable medium | |
DE69735866T2 (de) | Vorrichtung und Verfahren zur Erzeugung von voraussagbaren Antworten | |
DE3852378T2 (de) | Mechanismus und Verfahren zur entgegengesetzten Flussteuerung. | |
EP0913774A2 (en) | Managing computer for a plurality of computers connected via a network | |
CN1901517B (zh) | 信息交换系统和管理服务器、终端设备和用于降低网络负荷的方法 | |
EP1903750A1 (en) | Load distributing apparatus | |
JP2008527514A5 (zh) | ||
DE102010002327B4 (de) | Controller | |
EP3014438B1 (de) | Verfahren und vorrichtung zur zeitrichtigen datenübergabe an die zyklischen tasks in einem verteilten echtzeitsystem | |
CN108009258A (zh) | 一种可在线配置的数据采集与分析平台 | |
CN101894163A (zh) | 一种针对性能数据采集系统的数据库操作调度方法及装置 | |
CN103561092A (zh) | 私有云环境下管理资源的方法及装置 | |
CN112416551A (zh) | 一种分布式爬虫调度系统 | |
DE3851507T2 (de) | Flusssteuerungssystem für Bus. | |
DE69800322T2 (de) | Verfahren und Gerät für verbesserte Anrufsteuerungsablauffolgeplanung in einem verteilten System mit ungleichen Anrufverarbeitungsanlagen | |
WO2009135707A1 (de) | Teilnehmerknoten eines kommunikationssytems mit funktional getrenntem sende-ereignisspeicher | |
CN116010388A (zh) | 数据校验方法、数据采集服务端及数据校验系统 | |
CN113946422A (zh) | 一种动态分配的网站监测调度方法 | |
CN113660178A (zh) | 一种cdn内容管理系统 | |
CN103796182B (zh) | 一种消息发送系统和方法 | |
JP2881897B2 (ja) | 工程管理方法および装置 | |
CN110874430B (zh) | 网络爬虫调度方法、装置及设备 | |
CN112783634B (zh) | 任务处理系统、方法及计算机可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Pang Wenjun Inventor after: Chen Ji Inventor after: Huang Xing Inventor after: Li Xiaochao Inventor after: Yi Xiaoqiang Inventor before: Pang Wenjun Inventor before: Chen Ji Inventor before: Tang Guilin Inventor before: Li Xiaochao Inventor before: Yi Xiaoqiang |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210226 |