CN103678571A - 应用于单台多核处理器主机的多线程网络爬虫执行方法 - Google Patents
应用于单台多核处理器主机的多线程网络爬虫执行方法 Download PDFInfo
- Publication number
- CN103678571A CN103678571A CN201310661466.1A CN201310661466A CN103678571A CN 103678571 A CN103678571 A CN 103678571A CN 201310661466 A CN201310661466 A CN 201310661466A CN 103678571 A CN103678571 A CN 103678571A
- Authority
- CN
- China
- Prior art keywords
- url
- thread
- webpage
- buffer queue
- seed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000009193 crawling Effects 0.000 claims abstract description 52
- 238000004458 analytical method Methods 0.000 claims abstract description 39
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 238000004519 manufacturing process Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- 230000008707 rearrangement Effects 0.000 claims description 2
- 241001122315 Polites Species 0.000 abstract description 6
- 238000005538 encapsulation Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013486 operation strategy Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310661466.1A CN103678571B (zh) | 2013-12-09 | 2013-12-09 | 应用于单台多核处理器主机的多线程网络爬虫执行方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310661466.1A CN103678571B (zh) | 2013-12-09 | 2013-12-09 | 应用于单台多核处理器主机的多线程网络爬虫执行方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103678571A true CN103678571A (zh) | 2014-03-26 |
CN103678571B CN103678571B (zh) | 2017-01-25 |
Family
ID=50316116
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310661466.1A Active CN103678571B (zh) | 2013-12-09 | 2013-12-09 | 应用于单台多核处理器主机的多线程网络爬虫执行方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103678571B (zh) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677918A (zh) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | 一种基于Kafka和Quartz的分布式爬虫架构及其实现方法 |
CN106294393A (zh) * | 2015-05-20 | 2017-01-04 | 天脉聚源(北京)科技有限公司 | 一种网络搜索的方法和系统 |
CN106953780A (zh) * | 2017-03-15 | 2017-07-14 | 重庆邮电大学 | 一种支持网络产品信息查询的众核平台深度包检测装置及方法 |
CN108063759A (zh) * | 2017-12-05 | 2018-05-22 | 西安交大捷普网络科技有限公司 | web漏洞扫描方法 |
CN109670099A (zh) * | 2018-12-21 | 2019-04-23 | 全通教育集团(广东)股份有限公司 | 基于教育网络信息主题采集方法 |
CN112422707A (zh) * | 2020-10-22 | 2021-02-26 | 北京安博通科技股份有限公司 | 域名数据挖掘方法、装置及Redis服务器 |
CN113238711A (zh) * | 2021-04-17 | 2021-08-10 | 西安电子科技大学 | 一种电子数据取证领域中高效的哈希计算方法 |
CN114297463A (zh) * | 2021-12-20 | 2022-04-08 | 中孚信息股份有限公司 | 数据爬取方法、系统、计算机可读存储介质及电子设备 |
CN114417216A (zh) * | 2022-01-04 | 2022-04-29 | 马上消费金融股份有限公司 | 数据采集方法、装置、电子设备及可读存储介质 |
CN114817677A (zh) * | 2021-01-21 | 2022-07-29 | 中国移动通信有限公司研究院 | 一种爬虫调度方法、装置及系统 |
CN114900487A (zh) * | 2022-05-27 | 2022-08-12 | 深圳铸泰科技有限公司 | 基于内存设计优化流量抓取方法及系统 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6631369B1 (en) * | 1999-06-30 | 2003-10-07 | Microsoft Corporation | Method and system for incremental web crawling |
CN1811757A (zh) * | 1995-12-13 | 2006-08-02 | 奥弗图尔服务公司 | 用于定位万维网页以及计算机网络文件的系统和方法 |
CN103226609A (zh) * | 2013-05-03 | 2013-07-31 | 福建师范大学 | 一种web聚焦搜索系统的搜索方法 |
-
2013
- 2013-12-09 CN CN201310661466.1A patent/CN103678571B/zh active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1811757A (zh) * | 1995-12-13 | 2006-08-02 | 奥弗图尔服务公司 | 用于定位万维网页以及计算机网络文件的系统和方法 |
US6631369B1 (en) * | 1999-06-30 | 2003-10-07 | Microsoft Corporation | Method and system for incremental web crawling |
CN103226609A (zh) * | 2013-05-03 | 2013-07-31 | 福建师范大学 | 一种web聚焦搜索系统的搜索方法 |
Non-Patent Citations (3)
Title |
---|
吴丽辉 等: "Web信息采集中的哈希函数比较", 《小型微型计算机系统》 * |
梁萍: "搜索引擎中网络爬虫及结果聚类的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
金梅: "网络爬虫性能提升与功能拓展的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294393A (zh) * | 2015-05-20 | 2017-01-04 | 天脉聚源(北京)科技有限公司 | 一种网络搜索的方法和系统 |
CN105677918B (zh) * | 2016-03-03 | 2019-02-15 | 浪潮软件股份有限公司 | 一种基于Kafka和Quartz的分布式爬虫架构及其实现方法 |
CN105677918A (zh) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | 一种基于Kafka和Quartz的分布式爬虫架构及其实现方法 |
CN106953780A (zh) * | 2017-03-15 | 2017-07-14 | 重庆邮电大学 | 一种支持网络产品信息查询的众核平台深度包检测装置及方法 |
CN106953780B (zh) * | 2017-03-15 | 2020-04-07 | 重庆邮电大学 | 一种支持网络产品信息查询的众核平台深度包检测装置及方法 |
CN108063759A (zh) * | 2017-12-05 | 2018-05-22 | 西安交大捷普网络科技有限公司 | web漏洞扫描方法 |
CN108063759B (zh) * | 2017-12-05 | 2022-08-16 | 西安交大捷普网络科技有限公司 | Web漏洞扫描方法 |
CN109670099A (zh) * | 2018-12-21 | 2019-04-23 | 全通教育集团(广东)股份有限公司 | 基于教育网络信息主题采集方法 |
CN112422707A (zh) * | 2020-10-22 | 2021-02-26 | 北京安博通科技股份有限公司 | 域名数据挖掘方法、装置及Redis服务器 |
CN114817677A (zh) * | 2021-01-21 | 2022-07-29 | 中国移动通信有限公司研究院 | 一种爬虫调度方法、装置及系统 |
CN113238711A (zh) * | 2021-04-17 | 2021-08-10 | 西安电子科技大学 | 一种电子数据取证领域中高效的哈希计算方法 |
CN113238711B (zh) * | 2021-04-17 | 2024-02-02 | 西安电子科技大学 | 一种电子数据取证领域中高效的哈希计算方法 |
CN114297463A (zh) * | 2021-12-20 | 2022-04-08 | 中孚信息股份有限公司 | 数据爬取方法、系统、计算机可读存储介质及电子设备 |
CN114417216A (zh) * | 2022-01-04 | 2022-04-29 | 马上消费金融股份有限公司 | 数据采集方法、装置、电子设备及可读存储介质 |
CN114417216B (zh) * | 2022-01-04 | 2022-11-29 | 马上消费金融股份有限公司 | 数据采集方法、装置、电子设备及可读存储介质 |
CN114900487A (zh) * | 2022-05-27 | 2022-08-12 | 深圳铸泰科技有限公司 | 基于内存设计优化流量抓取方法及系统 |
CN114900487B (zh) * | 2022-05-27 | 2023-12-19 | 深圳铸泰科技有限公司 | 基于内存设计优化流量抓取方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN103678571B (zh) | 2017-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103678571B (zh) | 应用于单台多核处理器主机的多线程网络爬虫执行方法 | |
Nai et al. | Graphpim: Enabling instruction-level pim offloading in graph computing frameworks | |
Shi et al. | Fast and concurrent {RDF} queries with {RDMA-Based} distributed graph exploration | |
He et al. | RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems | |
CN104331497A (zh) | 一种利用向量指令并行处理文件索引的方法及装置 | |
Addisie et al. | Heterogeneous memory subsystem for natural graph analytics | |
CN102193830B (zh) | 面向众核环境的分治映射/归约并行编程模型 | |
Morari et al. | Scaling irregular applications through data aggregation and software multithreading | |
US20160335322A1 (en) | Automatic generation of multi-source breadth-first search from high-level graph language | |
Dees et al. | Efficient many-core query execution in main memory column-stores | |
Li et al. | GraphIA: An in-situ accelerator for large-scale graph processing | |
Chen et al. | Grasper: A high performance distributed system for OLAP on property graphs | |
Choi et al. | Memory harvesting in {Multi-GPU} systems with hierarchical unified virtual memory | |
Li et al. | PIM-WEAVER: A high energy-efficient, general-purpose acceleration architecture for string operations in big data processing | |
Kim et al. | BionicDB: Fast and Power-Efficient OLTP on FPGA. | |
You et al. | Scalable and efficient spatial data management on multi-core CPU and GPU clusters: A preliminary implementation based on Impala | |
Soysal et al. | A sparse memory allocation data structure for sequential and parallel association rule mining | |
Volk et al. | GPU-Based Speculative Query Processing for Database Operations. | |
Asiatici et al. | Request, coalesce, serve, and forget: Miss-optimized memory systems for bandwidth-bound cache-unfriendly applications on FPGAs | |
Malik et al. | Task scheduling for GPU accelerated hybrid OLAP systems with multi-core support and text-to-integer translation | |
Sharma et al. | Explain plan and SQL trace the two approaches for RDBMS tuning | |
Awan | Performance characterization and optimization of in-memory data analytics on a scale-up server | |
Breß et al. | Exploring the design space of a GPU-aware database architecture | |
Liu et al. | Straggler-aware parallel graph processing in hybrid memory systems | |
Tripathy et al. | Optimizing a semantic comparator using cuda-enabled graphics hardware |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230105 Address after: 510000 room 606-609, compound office complex building, No. 757, Dongfeng East Road, Yuexiu District, Guangzhou City, Guangdong Province (not for plant use) Patentee after: China Southern Power Grid Internet Service Co.,Ltd. Address before: Room 301, No. 235, Kexue Avenue, Huangpu District, Guangzhou, Guangdong 510000 Patentee before: OURCHEM INFORMATION CONSULTING CO.,LTD. Effective date of registration: 20230105 Address after: Room 301, No. 235, Kexue Avenue, Huangpu District, Guangzhou, Guangdong 510000 Patentee after: OURCHEM INFORMATION CONSULTING CO.,LTD. Address before: 1068 No. 518055 Guangdong city in Shenzhen Province, Nanshan District City Xili University School Avenue Patentee before: SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY CHINESE ACADEMY OF SCIENCES |