CN106021608A

CN106021608A - Distributed crawler system and implementing method thereof

Info

Publication number: CN106021608A
Application number: CN201610466951.7A
Authority: CN
Inventors: 余虎; 潘嘉朋; 张郭强; 徐少强
Original assignee: Guangdong Eshore Technology Co Ltd
Current assignee: Guangdong Eshore Technology Co Ltd
Priority date: 2016-06-22
Filing date: 2016-06-22
Publication date: 2016-10-12

Abstract

The invention discloses a distributed crawler system. The system comprises a page acquisition module, a target url (uniform resource locator) acquisition module, a scheduling monitoring module and a stored target url queue module. According to the system, a scheduling node is added, crawling logic and monitoring logic are separated, the overall situation is regulated, and once a main crawler node is found abnormal, another new node is immediately allocated to replace the main node. Besides, a buffered target url queue is changed to a lasting storage target url queue, a processed url queue is added and uniformly stores url to meet all crawling demands. Additionally, the invention further provides an implementing method of the distributed crawler system. Through proper regulation, the robustness of the system is enhanced, resources are saved, and the crawling efficiency is improved.

Description

A kind of distributed reptile system and its implementation

Technical field

The present invention relates to network information gathering field, particularly relate to a kind of distributed reptile system and realization side thereof Method.

Background technology

The development of social media with oneself, the information on the Internet is more and more huger.To one occurred in the near future Hot issue, the most only need to open web search the most just has a general understanding.But it is more on the Internet Be some rambling garbages, if if the current events of learning about, really have only to by search Engine is the most permissible.If that it should be understood that the generation of a hot issue, process, result, it is necessary to The information gathering means of specialty.

The Internet reptile is exactly the one of information gathering means.Web crawlers, is a kind of according to certain rule, Automatically capture program or the script of web message.Reptile application point two kinds on the Internet:

The search engine that a kind of picture Baidu, Google, search dog are this kind of, the reptile of this class is referred to as general reptile, What it was pursued is the biggest network coverage, but often returns the webpage that substantial amounts of user is not relevant for. And most general reptile is based on keyword match, it is difficult to be extended to support semantic inquiry.

With general reptile focused crawler by contrast, focused crawler refers to have certain objectives website Orientation reptile.The reptile of this class is interested in the webpage meeting ad hoc rule, and the result of return must be The result that user is concerned about.How to increase coverage rate and increase that to crawl efficiency be this kind of reptile problem to be considered. And the problems such as multiple different types of orientation resource reuses of producing parallel of reptile are also in the urgent need to address 's.The such as reptile of two different types just points to the same webpage of collection, so can enter this webpage Twice process of row, and reasonable mode is only to process this webpage once, then points to the reptile of this webpage all Information is obtained from the result processed for the first time.

As it is shown in figure 1, illustrate based on scrapy, redis, mongodb distributed network reptile framework, main Reptile is responsible for meeting according to rule collection the url of target and enters redies queue, by redies and page capture Node communication, page capture node obtains target url from redies, then carries out information gathering, collect Information be stored in mongodb cluster.Redis data base is used to replace the team that scrapy uses originally Array structure (deque), stores the request of engineering, stats information in redis, it is possible to on each machine Reptile realize centralized management, so solve the performance bottleneck of reptile, utilize redis efficiently and be prone to extend Can easily realize high efficiency to download, when redis storage or access speed run into bottleneck, can be by increasing Big redis cluster number and reptile number of clusters are improved.

Although said method have employed distributed strategy and has greatly improved the performance of reptile, but There are some problems:

1, main reptile should crawl the page capture work of url responsible each extension set node of regulation and control again, once Main reptile occurs that abnormal whole system is run quickly and bursts.

2, this distributed strategy crawling just for single target website, covers multiple websites for task object Scene, some url can be repeated several times and crawl, cause the serious wasting of resources.It addition, this strategy Redies data, the scheme that increment does not crawls must be fallen clearly before crawling every time.

Summary of the invention

The invention aims to overcome the defect of prior art, it is provided that a kind of distributed reptile system and Implementation method such that it is able in time regulate and control, strengthens the vigorousness of system, also saves resource simultaneously, strengthens The efficiency of reptile.

For achieving the above object, the invention provides a kind of distributed reptile system, described system includes:

Page capture module, target url acquisition module, dispatching and monitoring module and storage target url Queue module.

Page capture module, it is by extracting information from the queue of target url, and after extracting successfully, target url is inserted Processed url queue, the information collected stores mongodb cluster.

Target url acquisition module, is obtained the url meeting defined rule, and this url is pressed into by main reptile The queue of target url.

Dispatching and monitoring module, runs through whole system, directly contacts with each module, and monitors reptile state Information and cluster state information.

Storage target url Queue module, including two queues: the queue of target url and processed url queue.

Further, the scheduling node in described dispatching and monitoring module, separate with monitoring logic crawling logic, And to global regulation, the main reptile node of certain discovery occurs abnormal, and distributing another new node is substituted by master at once Node.

Further, dispatching and monitoring module can automatic monitored control system status information, wherein reptile status information bag Including beginning, end, exception, cluster state information includes whether idle, the most abnormal.

Additionally, present invention also offers the implementation method of a kind of distributed reptile system, the flow process of the method is:

System accepts to crawl request signal, and dispatching and monitoring module processes according to request dispensation machines and crawls url or adopt Collection page logic.

If crawling url request, target url acquisition module obtains enabling signal, dispatching and monitoring module assignment master Reptile node, the collection url logic of performance objective url acquisition module, gather the url press-in meeting request regular and deposit Target url queue in storage target url Queue module, until all satisfactory url cache into target url team Arrange or accept to stop performing during termination signal.

If collection page request, page capture module obtains enabling signal, and dispatching and monitoring module assignment is wanted Element reptile acquisition node, from storage target url Queue module, the target url queue of caching obtains what needs gathered Url, according to url requests for page, page capture module carries out structuring cleaning, collection result to the information of the page Being stored in mongodb cluster, this url deletes from the target url queue of storage url module, and is pressed into storage In the processed url queue of url module, repeat this operation until the target url queue of storage url module for empty or Accept termination signal.

The beneficial effect that technical solution of the present invention is brought:

1, the present invention is by increasing a unified scheduling node, crawling logic and controlling logic separately, right In the scene that main reptile is abnormal, the watch-dog of scheduling node can obtain this signal, thus from idle node Distribute the program space substituting main reptile, in like manner, the node abnormal for gathering page reptile, also can fit Time ground regulation and control, strengthen system vigorousness.

2, additionally, the present invention changes caching of target url queue as persistent storage target url queue into, increase simultaneously One processed url queue, for all of demand that crawls, url is unified to be deposited.So, need temporarily when scene When stopping crawling, system can record current target url queue and processed url queue, thus realizes continuing to start After increment crawl.Further, since the existence of processed url queue, the url overlapping for various tasks also can Duplicate removal simultaneously, while saving resource, also strengthens the efficiency of reptile.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to enforcement In example or description of the prior art, the required accompanying drawing used is briefly described, it should be apparent that, describe below In accompanying drawing be only some embodiments of the present invention, for those of ordinary skill in the art, do not paying On the premise of going out creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is distributed reptile frame diagram based on scrapy, redis, mongodb in prior art；

Fig. 2 is the distributed reptile system framework figure of the present invention；

Fig. 3 is the target url acquisition module flow chart of the present invention；

Fig. 4 is the page capture block flow diagram of the present invention；

Fig. 5 is the dispatching and monitoring module rack composition of the present invention.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly Chu, be fully described by, it is clear that described embodiment be only a part of embodiment of the present invention rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation The every other embodiment obtained under property work premise, broadly falls into the scope of protection of the invention.

The present invention is concerned with, for certain hot issue, reviewing its development the most more accurately, general Reptile is compared with focused crawler, hence it is evident that focused crawler compares the demand meeting the present invention.But common focusing Reptile only can capture specific single website, with the present invention, this wants that the idea monitoring whole the Internet exists again lance Shield.Therefore the present invention is on the basis of the gathering reptile pattern of single task, proposes the reptile framework of a multitask, By unified task scheduling, gathering and controlling to separate, thus more efficient, stable.

Be illustrated in figure 2 the present invention improve after distributed reptile system, this system include page capture module, Target url acquisition module, dispatching and monitoring module and storage target url Queue module.This system is by adding One scheduling node, separates with monitoring logic crawling logic, and to global regulation, once finds main reptile Node occurs abnormal, and distributing another new node is substituted by host node at once.Additionally increase storage target url queue Module, this module includes two queues: the queue of target url and processed url queue.

The target url acquisition module flow chart being illustrated in figure 3 in the present invention:

1, scheduling node capture commencing signal, dispensation machines starts main reptile engine.

2, main reptile obtains a url meeting defined rule from the Internet.

3, according to the processed url queue of caching, it is judged that this url is the most treated, is, then abandon this url；Otherwise, this url is pressed into the queue of target url.

4, repeat step 2 and 3, step not being until have applicable url or being believed by termination in the range of definition Number.

The page capture block flow diagram being illustrated in figure 4 in the present invention:

1, scheduler receives page capture signal, and dispensation machines starts page capture reptile.

2, page capture reptile obtains url from the queue of target url, then extracts information.As extracted successfully, Then this url releases from the queue of target url, and inserts processed url queue, and the information collected stores Mongodb cluster；As extracted failure, then send error log.

3, step 2 is repeated, until target url queuing data amount is 0 or receives stopping signal.

The dispatching and monitoring module rack composition being illustrated in figure 5 in the present invention, dispatching and monitoring module can monitor automatically System status information, mainly monitors two category informations: reptile status information and cluster state information.Reptile Status information includes starting, terminates, abnormal etc., cluster state information includes whether idle, the most abnormal etc.. Dispatching and monitoring module runs through whole system, directly contacts with each module.

The present invention is by increasing a unified scheduling node, and caching of target url queue is changed into persistence deposits Storage target url queue such that it is able in time regulate and control, strengthens the vigorousness of system, also saves resource simultaneously, Strengthen the efficiency of reptile.

Being described in detail the embodiment of the present invention above, specific case used herein is to the present invention's Principle and embodiment are set forth, and the explanation of above example is only intended to help to understand the side of the present invention Method and core concept thereof；Simultaneously for one of ordinary skill in the art, according to the thought of the present invention, All will change in detailed description of the invention and range of application, in sum, this specification content should not be managed Solve as limitation of the present invention.

Claims

1. a distributed reptile system, it is characterised in that described system includes:

Page capture module, target url acquisition module, dispatching and monitoring module and storage target url Queue module；

Page capture module, it is by extracting information from the queue of target url, and after extracting successfully, target url is inserted Processed url queue, the information collected stores mongodb cluster；

Target url acquisition module, is obtained the url meeting defined rule, and this url is pressed into by main reptile The queue of target url；

Dispatching and monitoring module, runs through whole system, directly contacts with each module, and monitors reptile state Information and cluster state information；

System the most according to claim 1, it is characterised in that the scheduling in described dispatching and monitoring module Node, separates with monitoring logic crawling logic, and to global regulation, the main reptile node of certain discovery occurs Abnormal, distributing another new node is substituted by host node at once.

System the most according to claim 1, it is characterised in that dispatching and monitoring module can monitor system automatically System status information, wherein reptile status information include starting, terminate, abnormal, cluster state information include be No idle, the most abnormal.

4. the implementation method of a distributed reptile system, it is characterised in that the flow process of the method is:

System accepts to crawl request signal, and dispatching and monitoring module processes according to request dispensation machines and crawls url or adopt Collection page logic；

If crawling url request, target url acquisition module obtains enabling signal, dispatching and monitoring module assignment master Reptile node, the collection url logic of performance objective url acquisition module, gather the url press-in meeting request regular and deposit Target url queue in storage target url Queue module, until all satisfactory url cache into target url team Arrange or accept to stop performing during termination signal；