CN106021608A - Distributed crawler system and implementing method thereof - Google Patents

Distributed crawler system and implementing method thereof Download PDF

Info

Publication number
CN106021608A
CN106021608A CN201610466951.7A CN201610466951A CN106021608A CN 106021608 A CN106021608 A CN 106021608A CN 201610466951 A CN201610466951 A CN 201610466951A CN 106021608 A CN106021608 A CN 106021608A
Authority
CN
China
Prior art keywords
url
module
queue
reptile
target url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610466951.7A
Other languages
Chinese (zh)
Inventor
余虎
潘嘉朋
张郭强
徐少强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Eshore Technology Co Ltd
Original Assignee
Guangdong Eshore Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Eshore Technology Co Ltd filed Critical Guangdong Eshore Technology Co Ltd
Priority to CN201610466951.7A priority Critical patent/CN106021608A/en
Publication of CN106021608A publication Critical patent/CN106021608A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a distributed crawler system. The system comprises a page acquisition module, a target url (uniform resource locator) acquisition module, a scheduling monitoring module and a stored target url queue module. According to the system, a scheduling node is added, crawling logic and monitoring logic are separated, the overall situation is regulated, and once a main crawler node is found abnormal, another new node is immediately allocated to replace the main node. Besides, a buffered target url queue is changed to a lasting storage target url queue, a processed url queue is added and uniformly stores url to meet all crawling demands. Additionally, the invention further provides an implementing method of the distributed crawler system. Through proper regulation, the robustness of the system is enhanced, resources are saved, and the crawling efficiency is improved.

Description

A kind of distributed reptile system and its implementation
Technical field
The present invention relates to network information gathering field, particularly relate to a kind of distributed reptile system and realization side thereof Method.
Background technology
The development of social media with oneself, the information on the Internet is more and more huger.To one occurred in the near future Hot issue, the most only need to open web search the most just has a general understanding.But it is more on the Internet Be some rambling garbages, if if the current events of learning about, really have only to by search Engine is the most permissible.If that it should be understood that the generation of a hot issue, process, result, it is necessary to The information gathering means of specialty.
The Internet reptile is exactly the one of information gathering means.Web crawlers, is a kind of according to certain rule, Automatically capture program or the script of web message.Reptile application point two kinds on the Internet:
The search engine that a kind of picture Baidu, Google, search dog are this kind of, the reptile of this class is referred to as general reptile, What it was pursued is the biggest network coverage, but often returns the webpage that substantial amounts of user is not relevant for. And most general reptile is based on keyword match, it is difficult to be extended to support semantic inquiry.
With general reptile focused crawler by contrast, focused crawler refers to have certain objectives website Orientation reptile.The reptile of this class is interested in the webpage meeting ad hoc rule, and the result of return must be The result that user is concerned about.How to increase coverage rate and increase that to crawl efficiency be this kind of reptile problem to be considered. And the problems such as multiple different types of orientation resource reuses of producing parallel of reptile are also in the urgent need to address 's.The such as reptile of two different types just points to the same webpage of collection, so can enter this webpage Twice process of row, and reasonable mode is only to process this webpage once, then points to the reptile of this webpage all Information is obtained from the result processed for the first time.
As it is shown in figure 1, illustrate based on scrapy, redis, mongodb distributed network reptile framework, main Reptile is responsible for meeting according to rule collection the url of target and enters redies queue, by redies and page capture Node communication, page capture node obtains target url from redies, then carries out information gathering, collect Information be stored in mongodb cluster.Redis data base is used to replace the team that scrapy uses originally Array structure (deque), stores the request of engineering, stats information in redis, it is possible to on each machine Reptile realize centralized management, so solve the performance bottleneck of reptile, utilize redis efficiently and be prone to extend Can easily realize high efficiency to download, when redis storage or access speed run into bottleneck, can be by increasing Big redis cluster number and reptile number of clusters are improved.
Although said method have employed distributed strategy and has greatly improved the performance of reptile, but There are some problems:
1, main reptile should crawl the page capture work of url responsible each extension set node of regulation and control again, once Main reptile occurs that abnormal whole system is run quickly and bursts.
2, this distributed strategy crawling just for single target website, covers multiple websites for task object Scene, some url can be repeated several times and crawl, cause the serious wasting of resources.It addition, this strategy Redies data, the scheme that increment does not crawls must be fallen clearly before crawling every time.
Summary of the invention
The invention aims to overcome the defect of prior art, it is provided that a kind of distributed reptile system and Implementation method such that it is able in time regulate and control, strengthens the vigorousness of system, also saves resource simultaneously, strengthens The efficiency of reptile.
For achieving the above object, the invention provides a kind of distributed reptile system, described system includes:
Page capture module, target url acquisition module, dispatching and monitoring module and storage target url Queue module.
Page capture module, it is by extracting information from the queue of target url, and after extracting successfully, target url is inserted Processed url queue, the information collected stores mongodb cluster.
Target url acquisition module, is obtained the url meeting defined rule, and this url is pressed into by main reptile The queue of target url.
Dispatching and monitoring module, runs through whole system, directly contacts with each module, and monitors reptile state Information and cluster state information.
Storage target url Queue module, including two queues: the queue of target url and processed url queue.
Further, the scheduling node in described dispatching and monitoring module, separate with monitoring logic crawling logic, And to global regulation, the main reptile node of certain discovery occurs abnormal, and distributing another new node is substituted by master at once Node.
Further, dispatching and monitoring module can automatic monitored control system status information, wherein reptile status information bag Including beginning, end, exception, cluster state information includes whether idle, the most abnormal.
Additionally, present invention also offers the implementation method of a kind of distributed reptile system, the flow process of the method is:
System accepts to crawl request signal, and dispatching and monitoring module processes according to request dispensation machines and crawls url or adopt Collection page logic.
If crawling url request, target url acquisition module obtains enabling signal, dispatching and monitoring module assignment master Reptile node, the collection url logic of performance objective url acquisition module, gather the url press-in meeting request regular and deposit Target url queue in storage target url Queue module, until all satisfactory url cache into target url team Arrange or accept to stop performing during termination signal.
If collection page request, page capture module obtains enabling signal, and dispatching and monitoring module assignment is wanted Element reptile acquisition node, from storage target url Queue module, the target url queue of caching obtains what needs gathered Url, according to url requests for page, page capture module carries out structuring cleaning, collection result to the information of the page Being stored in mongodb cluster, this url deletes from the target url queue of storage url module, and is pressed into storage In the processed url queue of url module, repeat this operation until the target url queue of storage url module for empty or Accept termination signal.
The beneficial effect that technical solution of the present invention is brought:
1, the present invention is by increasing a unified scheduling node, crawling logic and controlling logic separately, right In the scene that main reptile is abnormal, the watch-dog of scheduling node can obtain this signal, thus from idle node Distribute the program space substituting main reptile, in like manner, the node abnormal for gathering page reptile, also can fit Time ground regulation and control, strengthen system vigorousness.
2, additionally, the present invention changes caching of target url queue as persistent storage target url queue into, increase simultaneously One processed url queue, for all of demand that crawls, url is unified to be deposited.So, need temporarily when scene When stopping crawling, system can record current target url queue and processed url queue, thus realizes continuing to start After increment crawl.Further, since the existence of processed url queue, the url overlapping for various tasks also can Duplicate removal simultaneously, while saving resource, also strengthens the efficiency of reptile.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to enforcement In example or description of the prior art, the required accompanying drawing used is briefly described, it should be apparent that, describe below In accompanying drawing be only some embodiments of the present invention, for those of ordinary skill in the art, do not paying On the premise of going out creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is distributed reptile frame diagram based on scrapy, redis, mongodb in prior art;
Fig. 2 is the distributed reptile system framework figure of the present invention;
Fig. 3 is the target url acquisition module flow chart of the present invention;
Fig. 4 is the page capture block flow diagram of the present invention;
Fig. 5 is the dispatching and monitoring module rack composition of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly Chu, be fully described by, it is clear that described embodiment be only a part of embodiment of the present invention rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation The every other embodiment obtained under property work premise, broadly falls into the scope of protection of the invention.
The present invention is concerned with, for certain hot issue, reviewing its development the most more accurately, general Reptile is compared with focused crawler, hence it is evident that focused crawler compares the demand meeting the present invention.But common focusing Reptile only can capture specific single website, with the present invention, this wants that the idea monitoring whole the Internet exists again lance Shield.Therefore the present invention is on the basis of the gathering reptile pattern of single task, proposes the reptile framework of a multitask, By unified task scheduling, gathering and controlling to separate, thus more efficient, stable.
Be illustrated in figure 2 the present invention improve after distributed reptile system, this system include page capture module, Target url acquisition module, dispatching and monitoring module and storage target url Queue module.This system is by adding One scheduling node, separates with monitoring logic crawling logic, and to global regulation, once finds main reptile Node occurs abnormal, and distributing another new node is substituted by host node at once.Additionally increase storage target url queue Module, this module includes two queues: the queue of target url and processed url queue.
The target url acquisition module flow chart being illustrated in figure 3 in the present invention:
1, scheduling node capture commencing signal, dispensation machines starts main reptile engine.
2, main reptile obtains a url meeting defined rule from the Internet.
3, according to the processed url queue of caching, it is judged that this url is the most treated, is, then abandon this url;Otherwise, this url is pressed into the queue of target url.
4, repeat step 2 and 3, step not being until have applicable url or being believed by termination in the range of definition Number.
The page capture block flow diagram being illustrated in figure 4 in the present invention:
1, scheduler receives page capture signal, and dispensation machines starts page capture reptile.
2, page capture reptile obtains url from the queue of target url, then extracts information.As extracted successfully, Then this url releases from the queue of target url, and inserts processed url queue, and the information collected stores Mongodb cluster;As extracted failure, then send error log.
3, step 2 is repeated, until target url queuing data amount is 0 or receives stopping signal.
The dispatching and monitoring module rack composition being illustrated in figure 5 in the present invention, dispatching and monitoring module can monitor automatically System status information, mainly monitors two category informations: reptile status information and cluster state information.Reptile Status information includes starting, terminates, abnormal etc., cluster state information includes whether idle, the most abnormal etc.. Dispatching and monitoring module runs through whole system, directly contacts with each module.
The present invention is by increasing a unified scheduling node, and caching of target url queue is changed into persistence deposits Storage target url queue such that it is able in time regulate and control, strengthens the vigorousness of system, also saves resource simultaneously, Strengthen the efficiency of reptile.
Being described in detail the embodiment of the present invention above, specific case used herein is to the present invention's Principle and embodiment are set forth, and the explanation of above example is only intended to help to understand the side of the present invention Method and core concept thereof;Simultaneously for one of ordinary skill in the art, according to the thought of the present invention, All will change in detailed description of the invention and range of application, in sum, this specification content should not be managed Solve as limitation of the present invention.

Claims (4)

1. a distributed reptile system, it is characterised in that described system includes:
Page capture module, target url acquisition module, dispatching and monitoring module and storage target url Queue module;
Page capture module, it is by extracting information from the queue of target url, and after extracting successfully, target url is inserted Processed url queue, the information collected stores mongodb cluster;
Target url acquisition module, is obtained the url meeting defined rule, and this url is pressed into by main reptile The queue of target url;
Dispatching and monitoring module, runs through whole system, directly contacts with each module, and monitors reptile state Information and cluster state information;
Storage target url Queue module, including two queues: the queue of target url and processed url queue.
System the most according to claim 1, it is characterised in that the scheduling in described dispatching and monitoring module Node, separates with monitoring logic crawling logic, and to global regulation, the main reptile node of certain discovery occurs Abnormal, distributing another new node is substituted by host node at once.
System the most according to claim 1, it is characterised in that dispatching and monitoring module can monitor system automatically System status information, wherein reptile status information include starting, terminate, abnormal, cluster state information include be No idle, the most abnormal.
4. the implementation method of a distributed reptile system, it is characterised in that the flow process of the method is:
System accepts to crawl request signal, and dispatching and monitoring module processes according to request dispensation machines and crawls url or adopt Collection page logic;
If crawling url request, target url acquisition module obtains enabling signal, dispatching and monitoring module assignment master Reptile node, the collection url logic of performance objective url acquisition module, gather the url press-in meeting request regular and deposit Target url queue in storage target url Queue module, until all satisfactory url cache into target url team Arrange or accept to stop performing during termination signal;
If collection page request, page capture module obtains enabling signal, and dispatching and monitoring module assignment is wanted Element reptile acquisition node, from storage target url Queue module, the target url queue of caching obtains what needs gathered Url, according to url requests for page, page capture module carries out structuring cleaning, collection result to the information of the page Being stored in mongodb cluster, this url deletes from the target url queue of storage url module, and is pressed into storage In the processed url queue of url module, repeat this operation until the target url queue of storage url module for empty or Accept termination signal.
CN201610466951.7A 2016-06-22 2016-06-22 Distributed crawler system and implementing method thereof Pending CN106021608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610466951.7A CN106021608A (en) 2016-06-22 2016-06-22 Distributed crawler system and implementing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610466951.7A CN106021608A (en) 2016-06-22 2016-06-22 Distributed crawler system and implementing method thereof

Publications (1)

Publication Number Publication Date
CN106021608A true CN106021608A (en) 2016-10-12

Family

ID=57087244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610466951.7A Pending CN106021608A (en) 2016-06-22 2016-06-22 Distributed crawler system and implementing method thereof

Country Status (1)

Country Link
CN (1) CN106021608A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method
CN107145556A (en) * 2017-04-28 2017-09-08 安徽博约信息科技股份有限公司 General distributed parallel computing environment
CN107562541A (en) * 2017-09-05 2018-01-09 广东科杰通信息科技有限公司 A kind of distributed reptile method of load balancing, crawler system
CN107657053A (en) * 2017-10-17 2018-02-02 山东浪潮云服务信息科技有限公司 A kind of reptile implementation method and device
CN107908794A (en) * 2017-12-15 2018-04-13 广东工业大学 A kind of method of data mining, system, equipment and computer-readable recording medium
CN108228623A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 A kind of data processing method and client device
CN108366217A (en) * 2018-03-14 2018-08-03 成都创信特电子技术有限公司 Monitor video acquisition and storage method
CN109063216A (en) * 2018-10-17 2018-12-21 珠海市智图数研信息技术有限公司 A kind of distributed vertical service search crawler frame
CN111143336A (en) * 2019-11-27 2020-05-12 三盟科技股份有限公司 College scientific research data management-oriented web crawler management method and platform
CN111522654A (en) * 2020-03-18 2020-08-11 大箴(杭州)科技有限公司 Scheduling processing method, device and equipment for distributed crawler
CN113254747A (en) * 2021-06-09 2021-08-13 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler
CN113282759A (en) * 2021-04-23 2021-08-20 国网辽宁省电力有限公司电力科学研究院 Network security knowledge graph generation method based on threat information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102355488A (en) * 2011-08-15 2012-02-15 北京星网锐捷网络技术有限公司 Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN103870329A (en) * 2014-03-03 2014-06-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN104699757A (en) * 2015-01-15 2015-06-10 南京邮电大学 Distributed network information acquisition method in cloud environment
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102355488A (en) * 2011-08-15 2012-02-15 北京星网锐捷网络技术有限公司 Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN103870329A (en) * 2014-03-03 2014-06-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN104699757A (en) * 2015-01-15 2015-06-10 南京邮电大学 Distributed network information acquisition method in cloud environment
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN108228623A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 A kind of data processing method and client device
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method
CN106874487B (en) * 2017-02-21 2020-08-18 国信优易数据有限公司 Distributed crawler management system and method thereof
CN107145556A (en) * 2017-04-28 2017-09-08 安徽博约信息科技股份有限公司 General distributed parallel computing environment
CN107145556B (en) * 2017-04-28 2020-12-29 安徽博约信息科技股份有限公司 Universal distributed acquisition system
CN107562541B (en) * 2017-09-05 2020-08-11 广东科杰通信息科技有限公司 Load balancing distributed crawler method and crawler system
CN107562541A (en) * 2017-09-05 2018-01-09 广东科杰通信息科技有限公司 A kind of distributed reptile method of load balancing, crawler system
CN107657053A (en) * 2017-10-17 2018-02-02 山东浪潮云服务信息科技有限公司 A kind of reptile implementation method and device
CN107908794A (en) * 2017-12-15 2018-04-13 广东工业大学 A kind of method of data mining, system, equipment and computer-readable recording medium
CN108366217A (en) * 2018-03-14 2018-08-03 成都创信特电子技术有限公司 Monitor video acquisition and storage method
CN108366217B (en) * 2018-03-14 2021-04-06 成都创信特电子技术有限公司 Monitoring video acquisition and storage method
CN109063216A (en) * 2018-10-17 2018-12-21 珠海市智图数研信息技术有限公司 A kind of distributed vertical service search crawler frame
CN111143336A (en) * 2019-11-27 2020-05-12 三盟科技股份有限公司 College scientific research data management-oriented web crawler management method and platform
CN111522654A (en) * 2020-03-18 2020-08-11 大箴(杭州)科技有限公司 Scheduling processing method, device and equipment for distributed crawler
CN113282759A (en) * 2021-04-23 2021-08-20 国网辽宁省电力有限公司电力科学研究院 Network security knowledge graph generation method based on threat information
CN113282759B (en) * 2021-04-23 2024-02-20 国网辽宁省电力有限公司电力科学研究院 Threat information-based network security knowledge graph generation method
CN113254747A (en) * 2021-06-09 2021-08-13 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler
CN113254747B (en) * 2021-06-09 2021-10-15 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler

Similar Documents

Publication Publication Date Title
CN106021608A (en) Distributed crawler system and implementing method thereof
CN106487596B (en) Distributed service tracking implementation method
CN105824744B (en) A kind of real-time logs capturing analysis method based on B2B platform
CN103902386B (en) Multi-thread network crawler processing method based on connection proxy optimal management
CN101252465B (en) Warning data acquisition method and server and client end in system
CN106789377B (en) Service parameter updating method of network element cluster
CN105357296A (en) Elastic caching system based on Docker cloud platform
CN102957712A (en) Method and system for loading website resources
CN103617287A (en) Log management method and device in distributed environment
CN103902646A (en) Distributed task managing system and method
CN104754036A (en) Message processing system and processing method based on kafka
CN103853743A (en) Distributed system and log query method thereof
CN105045905B (en) A kind of log maintenance method and system based on full-text search
CN103995807B (en) Magnanimity data query and the method for after-treatment under a kind of framework based on Web
CN101662483A (en) Cache system for cloud computing system and method thereof
CN108875091A (en) A kind of distributed network crawler system of unified management
CN108228322B (en) Distributed link tracking and analyzing method, server and global scheduler
CN108009258A (en) It is a kind of can Configuration Online data collection and analysis platform
CN110647392A (en) Intelligent elastic expansion method based on container cluster
CN111522786A (en) Log processing system and method
CN102457578A (en) Distributed network monitoring method based on event mechanism
US9043535B1 (en) Minimizing application response time
CN109446441B (en) General credible distributed acquisition and storage system for network community
CN107395446B (en) Log real-time processing system
Hurst et al. Social streams blog crawler

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161012

WD01 Invention patent application deemed withdrawn after publication