CN106021608A - Distributed crawler system and implementing method thereof - Google Patents
Distributed crawler system and implementing method thereof Download PDFInfo
- Publication number
- CN106021608A CN106021608A CN201610466951.7A CN201610466951A CN106021608A CN 106021608 A CN106021608 A CN 106021608A CN 201610466951 A CN201610466951 A CN 201610466951A CN 106021608 A CN106021608 A CN 106021608A
- Authority
- CN
- China
- Prior art keywords
- url
- module
- queue
- reptile
- target url
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a distributed crawler system. The system comprises a page acquisition module, a target url (uniform resource locator) acquisition module, a scheduling monitoring module and a stored target url queue module. According to the system, a scheduling node is added, crawling logic and monitoring logic are separated, the overall situation is regulated, and once a main crawler node is found abnormal, another new node is immediately allocated to replace the main node. Besides, a buffered target url queue is changed to a lasting storage target url queue, a processed url queue is added and uniformly stores url to meet all crawling demands. Additionally, the invention further provides an implementing method of the distributed crawler system. Through proper regulation, the robustness of the system is enhanced, resources are saved, and the crawling efficiency is improved.
Description
Technical field
The present invention relates to network information gathering field, particularly relate to a kind of distributed reptile system and realization side thereof
Method.
Background technology
The development of social media with oneself, the information on the Internet is more and more huger.To one occurred in the near future
Hot issue, the most only need to open web search the most just has a general understanding.But it is more on the Internet
Be some rambling garbages, if if the current events of learning about, really have only to by search
Engine is the most permissible.If that it should be understood that the generation of a hot issue, process, result, it is necessary to
The information gathering means of specialty.
The Internet reptile is exactly the one of information gathering means.Web crawlers, is a kind of according to certain rule,
Automatically capture program or the script of web message.Reptile application point two kinds on the Internet:
The search engine that a kind of picture Baidu, Google, search dog are this kind of, the reptile of this class is referred to as general reptile,
What it was pursued is the biggest network coverage, but often returns the webpage that substantial amounts of user is not relevant for.
And most general reptile is based on keyword match, it is difficult to be extended to support semantic inquiry.
With general reptile focused crawler by contrast, focused crawler refers to have certain objectives website
Orientation reptile.The reptile of this class is interested in the webpage meeting ad hoc rule, and the result of return must be
The result that user is concerned about.How to increase coverage rate and increase that to crawl efficiency be this kind of reptile problem to be considered.
And the problems such as multiple different types of orientation resource reuses of producing parallel of reptile are also in the urgent need to address
's.The such as reptile of two different types just points to the same webpage of collection, so can enter this webpage
Twice process of row, and reasonable mode is only to process this webpage once, then points to the reptile of this webpage all
Information is obtained from the result processed for the first time.
As it is shown in figure 1, illustrate based on scrapy, redis, mongodb distributed network reptile framework, main
Reptile is responsible for meeting according to rule collection the url of target and enters redies queue, by redies and page capture
Node communication, page capture node obtains target url from redies, then carries out information gathering, collect
Information be stored in mongodb cluster.Redis data base is used to replace the team that scrapy uses originally
Array structure (deque), stores the request of engineering, stats information in redis, it is possible to on each machine
Reptile realize centralized management, so solve the performance bottleneck of reptile, utilize redis efficiently and be prone to extend
Can easily realize high efficiency to download, when redis storage or access speed run into bottleneck, can be by increasing
Big redis cluster number and reptile number of clusters are improved.
Although said method have employed distributed strategy and has greatly improved the performance of reptile, but
There are some problems:
1, main reptile should crawl the page capture work of url responsible each extension set node of regulation and control again, once
Main reptile occurs that abnormal whole system is run quickly and bursts.
2, this distributed strategy crawling just for single target website, covers multiple websites for task object
Scene, some url can be repeated several times and crawl, cause the serious wasting of resources.It addition, this strategy
Redies data, the scheme that increment does not crawls must be fallen clearly before crawling every time.
Summary of the invention
The invention aims to overcome the defect of prior art, it is provided that a kind of distributed reptile system and
Implementation method such that it is able in time regulate and control, strengthens the vigorousness of system, also saves resource simultaneously, strengthens
The efficiency of reptile.
For achieving the above object, the invention provides a kind of distributed reptile system, described system includes:
Page capture module, target url acquisition module, dispatching and monitoring module and storage target url Queue module.
Page capture module, it is by extracting information from the queue of target url, and after extracting successfully, target url is inserted
Processed url queue, the information collected stores mongodb cluster.
Target url acquisition module, is obtained the url meeting defined rule, and this url is pressed into by main reptile
The queue of target url.
Dispatching and monitoring module, runs through whole system, directly contacts with each module, and monitors reptile state
Information and cluster state information.
Storage target url Queue module, including two queues: the queue of target url and processed url queue.
Further, the scheduling node in described dispatching and monitoring module, separate with monitoring logic crawling logic,
And to global regulation, the main reptile node of certain discovery occurs abnormal, and distributing another new node is substituted by master at once
Node.
Further, dispatching and monitoring module can automatic monitored control system status information, wherein reptile status information bag
Including beginning, end, exception, cluster state information includes whether idle, the most abnormal.
Additionally, present invention also offers the implementation method of a kind of distributed reptile system, the flow process of the method is:
System accepts to crawl request signal, and dispatching and monitoring module processes according to request dispensation machines and crawls url or adopt
Collection page logic.
If crawling url request, target url acquisition module obtains enabling signal, dispatching and monitoring module assignment master
Reptile node, the collection url logic of performance objective url acquisition module, gather the url press-in meeting request regular and deposit
Target url queue in storage target url Queue module, until all satisfactory url cache into target url team
Arrange or accept to stop performing during termination signal.
If collection page request, page capture module obtains enabling signal, and dispatching and monitoring module assignment is wanted
Element reptile acquisition node, from storage target url Queue module, the target url queue of caching obtains what needs gathered
Url, according to url requests for page, page capture module carries out structuring cleaning, collection result to the information of the page
Being stored in mongodb cluster, this url deletes from the target url queue of storage url module, and is pressed into storage
In the processed url queue of url module, repeat this operation until the target url queue of storage url module for empty or
Accept termination signal.
The beneficial effect that technical solution of the present invention is brought:
1, the present invention is by increasing a unified scheduling node, crawling logic and controlling logic separately, right
In the scene that main reptile is abnormal, the watch-dog of scheduling node can obtain this signal, thus from idle node
Distribute the program space substituting main reptile, in like manner, the node abnormal for gathering page reptile, also can fit
Time ground regulation and control, strengthen system vigorousness.
2, additionally, the present invention changes caching of target url queue as persistent storage target url queue into, increase simultaneously
One processed url queue, for all of demand that crawls, url is unified to be deposited.So, need temporarily when scene
When stopping crawling, system can record current target url queue and processed url queue, thus realizes continuing to start
After increment crawl.Further, since the existence of processed url queue, the url overlapping for various tasks also can
Duplicate removal simultaneously, while saving resource, also strengthens the efficiency of reptile.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to enforcement
In example or description of the prior art, the required accompanying drawing used is briefly described, it should be apparent that, describe below
In accompanying drawing be only some embodiments of the present invention, for those of ordinary skill in the art, do not paying
On the premise of going out creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is distributed reptile frame diagram based on scrapy, redis, mongodb in prior art;
Fig. 2 is the distributed reptile system framework figure of the present invention;
Fig. 3 is the target url acquisition module flow chart of the present invention;
Fig. 4 is the page capture block flow diagram of the present invention;
Fig. 5 is the dispatching and monitoring module rack composition of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly
Chu, be fully described by, it is clear that described embodiment be only a part of embodiment of the present invention rather than
Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation
The every other embodiment obtained under property work premise, broadly falls into the scope of protection of the invention.
The present invention is concerned with, for certain hot issue, reviewing its development the most more accurately, general
Reptile is compared with focused crawler, hence it is evident that focused crawler compares the demand meeting the present invention.But common focusing
Reptile only can capture specific single website, with the present invention, this wants that the idea monitoring whole the Internet exists again lance
Shield.Therefore the present invention is on the basis of the gathering reptile pattern of single task, proposes the reptile framework of a multitask,
By unified task scheduling, gathering and controlling to separate, thus more efficient, stable.
Be illustrated in figure 2 the present invention improve after distributed reptile system, this system include page capture module,
Target url acquisition module, dispatching and monitoring module and storage target url Queue module.This system is by adding
One scheduling node, separates with monitoring logic crawling logic, and to global regulation, once finds main reptile
Node occurs abnormal, and distributing another new node is substituted by host node at once.Additionally increase storage target url queue
Module, this module includes two queues: the queue of target url and processed url queue.
The target url acquisition module flow chart being illustrated in figure 3 in the present invention:
1, scheduling node capture commencing signal, dispensation machines starts main reptile engine.
2, main reptile obtains a url meeting defined rule from the Internet.
3, according to the processed url queue of caching, it is judged that this url is the most treated, is, then abandon this
url;Otherwise, this url is pressed into the queue of target url.
4, repeat step 2 and 3, step not being until have applicable url or being believed by termination in the range of definition
Number.
The page capture block flow diagram being illustrated in figure 4 in the present invention:
1, scheduler receives page capture signal, and dispensation machines starts page capture reptile.
2, page capture reptile obtains url from the queue of target url, then extracts information.As extracted successfully,
Then this url releases from the queue of target url, and inserts processed url queue, and the information collected stores
Mongodb cluster;As extracted failure, then send error log.
3, step 2 is repeated, until target url queuing data amount is 0 or receives stopping signal.
The dispatching and monitoring module rack composition being illustrated in figure 5 in the present invention, dispatching and monitoring module can monitor automatically
System status information, mainly monitors two category informations: reptile status information and cluster state information.Reptile
Status information includes starting, terminates, abnormal etc., cluster state information includes whether idle, the most abnormal etc..
Dispatching and monitoring module runs through whole system, directly contacts with each module.
The present invention is by increasing a unified scheduling node, and caching of target url queue is changed into persistence deposits
Storage target url queue such that it is able in time regulate and control, strengthens the vigorousness of system, also saves resource simultaneously,
Strengthen the efficiency of reptile.
Being described in detail the embodiment of the present invention above, specific case used herein is to the present invention's
Principle and embodiment are set forth, and the explanation of above example is only intended to help to understand the side of the present invention
Method and core concept thereof;Simultaneously for one of ordinary skill in the art, according to the thought of the present invention,
All will change in detailed description of the invention and range of application, in sum, this specification content should not be managed
Solve as limitation of the present invention.
Claims (4)
1. a distributed reptile system, it is characterised in that described system includes:
Page capture module, target url acquisition module, dispatching and monitoring module and storage target url Queue module;
Page capture module, it is by extracting information from the queue of target url, and after extracting successfully, target url is inserted
Processed url queue, the information collected stores mongodb cluster;
Target url acquisition module, is obtained the url meeting defined rule, and this url is pressed into by main reptile
The queue of target url;
Dispatching and monitoring module, runs through whole system, directly contacts with each module, and monitors reptile state
Information and cluster state information;
Storage target url Queue module, including two queues: the queue of target url and processed url queue.
System the most according to claim 1, it is characterised in that the scheduling in described dispatching and monitoring module
Node, separates with monitoring logic crawling logic, and to global regulation, the main reptile node of certain discovery occurs
Abnormal, distributing another new node is substituted by host node at once.
System the most according to claim 1, it is characterised in that dispatching and monitoring module can monitor system automatically
System status information, wherein reptile status information include starting, terminate, abnormal, cluster state information include be
No idle, the most abnormal.
4. the implementation method of a distributed reptile system, it is characterised in that the flow process of the method is:
System accepts to crawl request signal, and dispatching and monitoring module processes according to request dispensation machines and crawls url or adopt
Collection page logic;
If crawling url request, target url acquisition module obtains enabling signal, dispatching and monitoring module assignment master
Reptile node, the collection url logic of performance objective url acquisition module, gather the url press-in meeting request regular and deposit
Target url queue in storage target url Queue module, until all satisfactory url cache into target url team
Arrange or accept to stop performing during termination signal;
If collection page request, page capture module obtains enabling signal, and dispatching and monitoring module assignment is wanted
Element reptile acquisition node, from storage target url Queue module, the target url queue of caching obtains what needs gathered
Url, according to url requests for page, page capture module carries out structuring cleaning, collection result to the information of the page
Being stored in mongodb cluster, this url deletes from the target url queue of storage url module, and is pressed into storage
In the processed url queue of url module, repeat this operation until the target url queue of storage url module for empty or
Accept termination signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610466951.7A CN106021608A (en) | 2016-06-22 | 2016-06-22 | Distributed crawler system and implementing method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610466951.7A CN106021608A (en) | 2016-06-22 | 2016-06-22 | Distributed crawler system and implementing method thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106021608A true CN106021608A (en) | 2016-10-12 |
Family
ID=57087244
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610466951.7A Pending CN106021608A (en) | 2016-06-22 | 2016-06-22 | Distributed crawler system and implementing method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021608A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484886A (en) * | 2016-10-17 | 2017-03-08 | 金蝶软件(中国)有限公司 | A kind of method of data acquisition and its relevant device |
CN106874487A (en) * | 2017-02-21 | 2017-06-20 | 国信优易数据有限公司 | A kind of distributed reptile management system and its method |
CN107145556A (en) * | 2017-04-28 | 2017-09-08 | 安徽博约信息科技股份有限公司 | General distributed parallel computing environment |
CN107562541A (en) * | 2017-09-05 | 2018-01-09 | 广东科杰通信息科技有限公司 | A kind of distributed reptile method of load balancing, crawler system |
CN107657053A (en) * | 2017-10-17 | 2018-02-02 | 山东浪潮云服务信息科技有限公司 | A kind of reptile implementation method and device |
CN107908794A (en) * | 2017-12-15 | 2018-04-13 | 广东工业大学 | A kind of method of data mining, system, equipment and computer-readable recording medium |
CN108228623A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | A kind of data processing method and client device |
CN108366217A (en) * | 2018-03-14 | 2018-08-03 | 成都创信特电子技术有限公司 | Monitor video acquisition and storage method |
CN109063216A (en) * | 2018-10-17 | 2018-12-21 | 珠海市智图数研信息技术有限公司 | A kind of distributed vertical service search crawler frame |
CN111143336A (en) * | 2019-11-27 | 2020-05-12 | 三盟科技股份有限公司 | College scientific research data management-oriented web crawler management method and platform |
CN111522654A (en) * | 2020-03-18 | 2020-08-11 | 大箴(杭州)科技有限公司 | Scheduling processing method, device and equipment for distributed crawler |
CN113254747A (en) * | 2021-06-09 | 2021-08-13 | 南京北斗创新应用科技研究院有限公司 | Geographic space data acquisition system and method based on distributed web crawler |
CN113282759A (en) * | 2021-04-23 | 2021-08-20 | 国网辽宁省电力有限公司电力科学研究院 | Network security knowledge graph generation method based on threat information |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102355488A (en) * | 2011-08-15 | 2012-02-15 | 北京星网锐捷网络技术有限公司 | Crawler seed obtaining method and equipment and crawler crawling method and equipment |
CN103870329A (en) * | 2014-03-03 | 2014-06-18 | 同济大学 | Distributed crawler task scheduling method based on weighted round-robin algorithm |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN104699757A (en) * | 2015-01-15 | 2015-06-10 | 南京邮电大学 | Distributed network information acquisition method in cloud environment |
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
-
2016
- 2016-06-22 CN CN201610466951.7A patent/CN106021608A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102355488A (en) * | 2011-08-15 | 2012-02-15 | 北京星网锐捷网络技术有限公司 | Crawler seed obtaining method and equipment and crawler crawling method and equipment |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN103870329A (en) * | 2014-03-03 | 2014-06-18 | 同济大学 | Distributed crawler task scheduling method based on weighted round-robin algorithm |
CN104699757A (en) * | 2015-01-15 | 2015-06-10 | 南京邮电大学 | Distributed network information acquisition method in cloud environment |
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484886A (en) * | 2016-10-17 | 2017-03-08 | 金蝶软件(中国)有限公司 | A kind of method of data acquisition and its relevant device |
CN108228623A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | A kind of data processing method and client device |
CN106874487A (en) * | 2017-02-21 | 2017-06-20 | 国信优易数据有限公司 | A kind of distributed reptile management system and its method |
CN106874487B (en) * | 2017-02-21 | 2020-08-18 | 国信优易数据有限公司 | Distributed crawler management system and method thereof |
CN107145556A (en) * | 2017-04-28 | 2017-09-08 | 安徽博约信息科技股份有限公司 | General distributed parallel computing environment |
CN107145556B (en) * | 2017-04-28 | 2020-12-29 | 安徽博约信息科技股份有限公司 | Universal distributed acquisition system |
CN107562541B (en) * | 2017-09-05 | 2020-08-11 | 广东科杰通信息科技有限公司 | Load balancing distributed crawler method and crawler system |
CN107562541A (en) * | 2017-09-05 | 2018-01-09 | 广东科杰通信息科技有限公司 | A kind of distributed reptile method of load balancing, crawler system |
CN107657053A (en) * | 2017-10-17 | 2018-02-02 | 山东浪潮云服务信息科技有限公司 | A kind of reptile implementation method and device |
CN107908794A (en) * | 2017-12-15 | 2018-04-13 | 广东工业大学 | A kind of method of data mining, system, equipment and computer-readable recording medium |
CN108366217A (en) * | 2018-03-14 | 2018-08-03 | 成都创信特电子技术有限公司 | Monitor video acquisition and storage method |
CN108366217B (en) * | 2018-03-14 | 2021-04-06 | 成都创信特电子技术有限公司 | Monitoring video acquisition and storage method |
CN109063216A (en) * | 2018-10-17 | 2018-12-21 | 珠海市智图数研信息技术有限公司 | A kind of distributed vertical service search crawler frame |
CN111143336A (en) * | 2019-11-27 | 2020-05-12 | 三盟科技股份有限公司 | College scientific research data management-oriented web crawler management method and platform |
CN111522654A (en) * | 2020-03-18 | 2020-08-11 | 大箴(杭州)科技有限公司 | Scheduling processing method, device and equipment for distributed crawler |
CN113282759A (en) * | 2021-04-23 | 2021-08-20 | 国网辽宁省电力有限公司电力科学研究院 | Network security knowledge graph generation method based on threat information |
CN113282759B (en) * | 2021-04-23 | 2024-02-20 | 国网辽宁省电力有限公司电力科学研究院 | Threat information-based network security knowledge graph generation method |
CN113254747A (en) * | 2021-06-09 | 2021-08-13 | 南京北斗创新应用科技研究院有限公司 | Geographic space data acquisition system and method based on distributed web crawler |
CN113254747B (en) * | 2021-06-09 | 2021-10-15 | 南京北斗创新应用科技研究院有限公司 | Geographic space data acquisition system and method based on distributed web crawler |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021608A (en) | Distributed crawler system and implementing method thereof | |
CN106487596B (en) | Distributed service tracking implementation method | |
CN105824744B (en) | A kind of real-time logs capturing analysis method based on B2B platform | |
CN103902386B (en) | Multi-thread network crawler processing method based on connection proxy optimal management | |
CN101252465B (en) | Warning data acquisition method and server and client end in system | |
CN106789377B (en) | Service parameter updating method of network element cluster | |
CN105357296A (en) | Elastic caching system based on Docker cloud platform | |
CN102957712A (en) | Method and system for loading website resources | |
CN103617287A (en) | Log management method and device in distributed environment | |
CN103902646A (en) | Distributed task managing system and method | |
CN104754036A (en) | Message processing system and processing method based on kafka | |
CN103853743A (en) | Distributed system and log query method thereof | |
CN105045905B (en) | A kind of log maintenance method and system based on full-text search | |
CN103995807B (en) | Magnanimity data query and the method for after-treatment under a kind of framework based on Web | |
CN101662483A (en) | Cache system for cloud computing system and method thereof | |
CN108875091A (en) | A kind of distributed network crawler system of unified management | |
CN108228322B (en) | Distributed link tracking and analyzing method, server and global scheduler | |
CN108009258A (en) | It is a kind of can Configuration Online data collection and analysis platform | |
CN110647392A (en) | Intelligent elastic expansion method based on container cluster | |
CN111522786A (en) | Log processing system and method | |
CN102457578A (en) | Distributed network monitoring method based on event mechanism | |
US9043535B1 (en) | Minimizing application response time | |
CN109446441B (en) | General credible distributed acquisition and storage system for network community | |
CN107395446B (en) | Log real-time processing system | |
Hurst et al. | Social streams blog crawler |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20161012 |
|
WD01 | Invention patent application deemed withdrawn after publication |