CN110457556A

CN110457556A - Distributed reptile system architecture, the method and computer equipment for crawling data

Info

Publication number: CN110457556A
Application number: CN201910601110.6A
Authority: CN
Inventors: 车驰; 李钢; 权佳成; 谭瑞; 张瑜
Original assignee: Chongqing Financial Assets Exchange LLC
Current assignee: Guangzhou Zhongtian Technology Consulting Co ltd
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2019-11-15
Anticipated expiration: 2039-07-04
Also published as: CN110457556B

Abstract

This application discloses a kind of distributed reptile system architecture, the method and computer equipment of data are crawled, wherein method includes: to obtain crawler task using task release module, and crawler task is sent to crawler module；After crawler module gets crawler task, call and crawl into crawler service module corresponding target crawler is required to service, and serviced using target crawler, to targeted website on crawl original crawler data；By the original crawler data crawled storage to preset first memory module.The distributed reptile system architecture of the application, distributed reptile crawl method and computer equipment of data etc., crawler service module is set, the bottom demand of entire crawler system is packaged, carry out modularization, the processing of serviceization, the workload of developer is reduced, and does not limit the development language of developer, reduces ability need；The stability and extended capability of crawler system are promoted by architecture design.

Description

Distributed reptile system architecture, the method and computer equipment for crawling data

Technical field

This application involves data collecting field is arrived, especially relates to a kind of distributed reptile system architecture, crawls data Method and computer equipment.

Background technique

Current crawler Platform Designing is customized exploitation mainly for single business scenario, total between different crawlers It is to need the independent module for writing demand, which results in the stabilizations that most of crawler system does not account for whole system Property and versatility, the exploitation maintenance efficiency of developer are low.

Summary of the invention

The main purpose of the application is to provide a kind of distributed reptile system architecture, crawl the method for data and computer is set It is standby, it is intended to solve distributed reptile system mine in the prior art and build stability and poor universality, effect is safeguarded in the exploitation of developer The low problem of rate.

In order to achieve the above-mentioned object of the invention, the application proposes that a kind of distributed reptile system architecture, the design of the framework make With HTTP service register mode, different modules is isolated, between different modules using message queue mode into The mutual access of row, the framework include:

Task release module, for issuing crawler task；

Crawler service module, for storing with crawler service different existing for service form, different crawler clothes Different crawler tasks is completed in business；

Crawler module is arrived for receiving the crawler task of the task release module publication, and according to the crawler task Crawler service corresponding with the crawler task is called in the crawler service module, utilizes crawler service to targeted website It carries out crawling movement, obtains corresponding original crawler data；

First data memory module, for storing the original crawler data；

Data cleansing module, for cleaning the original crawler data in first data memory module, after obtaining screening The first crawler data；

Second data memory module, for storing the first crawler data；

Back Administration Module is used to form visualization interface, realizes human-computer interaction on the visualization interface；

Then log and error handling module are obtained for obtaining the log that other modules generate in the system architecture Error log in the day handles the corresponding event of the error log according to preset rules.

The application also provides a kind of method that distributed reptile crawls data, based on above-mentioned distributed reptile system tray Structure, comprising:

Crawler task is obtained using the task release module, and the crawler task is sent to the crawler module, The crawler task includes targeted website and crawls requirement；

After the crawler module gets the crawler task, calls into the crawler service module and wanted with described crawl Ask corresponding target crawler to service, and serviced using the target crawler, to the targeted website on crawl original crawler data, Wherein, at least one is packaged in the crawler service module with the crawler service of service form encapsulation；

By the original crawler data storage crawled to preset first memory module.

Further, the described the step of crawler task is sent to the crawler module, comprising:

The task release module sends the crawler task to the crawler module in the form of message queue.

Further, the step of original crawler data storage that will be crawled is to preset first memory module it Afterwards, which comprises

The original crawler data in first memory module are cleaned using data cleansing module, after obtaining cleaning The first crawler data, and by the first crawler data storage to preset second memory module.

Further, the method also includes:

The log of other modules in the distributed reptile system architecture is obtained using the log and error handling module Data, and obtain the error log in the daily record data；

The corresponding event of the error log is handled according to preset rules.

Further, after the step of event corresponding according to the preset rules processing error log, comprising:

Generate the error reporting of the corresponding event using the log and error handling module, and by the error reporting It is sent to preset mailbox.

Further, the step of event corresponding according to the preset rules processing error log, comprising:

Judge whether the event is that crawler is failed using the log and error handling module；

If the event is crawler failure, the corresponding crawler task of the event is issued again.

Further, the method, further includes:

Judge whether to receive the incoming administration order of the Back Administration Module；

If so, administration order described in priority processing.

The application also provides a kind of computer equipment, including memory and processor, and the memory is stored with computer The step of program, the processor realizes any of the above-described the method when executing the computer program.

The application also provides a kind of computer readable storage medium, is stored thereon with computer program, the computer journey The step of method described in any of the above embodiments is realized when sequence is executed by processor.

The distributed reptile system architecture of the application, distributed reptile crawl the method, computer equipment and calculating of data Machine readable storage medium storing program for executing, the mode that the design of above-mentioned framework is registered using HTTP service, different modules is isolated, different Module between using the mode of message queue carry out mutual access.Using this design scheme can reduce system module it Between coupling, and the asynchronous message processing capacity of message queue can facilitate system with the parallel ability of lifting system data processing It is carried out when promoting processing capacity extending transversely.Crawler service module is set, it is interior for storing crawler service, by entire crawler The bottom demand of system is packaged, and carries out modularization, the processing of service, reduces the workload of developer and unlimited The development language of developer processed reduces ability need；The stability and extended capability of crawler system are promoted by architecture design, Suitable for the large-scale crawler system exploitation of multitask；Visual Back Administration Module, so that the operation management of whole system It is more reliable efficient.

Detailed description of the invention

Fig. 1 is the structural schematic block diagram of the distributed reptile system architecture of one embodiment of the application；

Fig. 2 is the flow diagram that the distributed reptile of one embodiment of the application crawls the method for data；

Fig. 3 is the structural schematic block diagram for applying for the computer equipment of an embodiment.

The embodiments will be further described with reference to the accompanying drawings for realization, functional characteristics and the advantage of the application purpose.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not For limiting the application.

Referring to Fig.1, the application proposes that a kind of distributed reptile system architecture, the design of the framework use HTTP service first The mode of registration is isolated by different modules, carries out mutual visit using the mode of message queue between different modules It asks, the framework includes:

Task release module 10, for issuing crawler task；

Crawler service module 20, for storing with crawler service different existing for service form, the different crawlers Different crawler tasks is completed in service；

Crawler module 30, for receiving the crawler task of task release module publication, and according to the crawler task, Crawler service corresponding with the crawler task is called into the crawler service module, utilizes crawler service to target network Station carries out crawling movement, obtains corresponding original crawler data；

First data memory module 40, for storing the original crawler data；

Data cleansing module 50 is screened for cleaning the original crawler data in first data memory module The first crawler data afterwards；

Second data memory module 60, for storing the first crawler data；

Back Administration Module 70, is used to form visualization interface, realizes human-computer interaction on the visualization interface；

Then log and error handling module 80 are obtained for obtaining the log that other modules generate in the system architecture The error log in the day is taken, handles the corresponding event of the error log according to preset rules.

In the present embodiment, the mode that the design of above-mentioned framework is registered using HTTP service, by different module carry out every From carrying out mutual access using the mode of message queue between different modules.It can be reduced using this design scheme and be Unite module between coupling, and the asynchronous message processing capacity of message queue can with the parallel ability of lifting system data processing, System is facilitated to carry out when promoting processing capacity extending transversely.In above-mentioned framework, using the mode of docker containerization by system Environment, module service, storage system be packaged and are integrated, and the mode that script can be used carries out one-touch portion to system Administration, starting.When needing to be deployed to new environment, it is only necessary to container file be migrated to the migration for just completing system, transported The deployment of system can be completed in row starting script.In above-mentioned framework, crawlers are not compromised by first floor system development language It limits and unified language can only be used to be developed；The basis for using module can be provided in system for different development languages Software support；The written in code that crawler developer only needs to be performed service logic in this way forms the service of corresponding crawler, and by its In incoming crawler service module, so that it may complete exploitation, maintenance of entire crawlers etc..

Referring to Fig. 2, the embodiment of the present application also provides a kind of method that distributed reptile crawls data, based on such as above-mentioned implementation The distributed reptile system architecture of example, comprising steps of

S1, crawler task is obtained using the task release module, and the crawler task is sent to the crawler mould Block, the crawler task include targeted website and crawl requirement；

After S2, the crawler module get the crawler task, calls into the crawler service module and climbed with described Take and corresponding target crawler required to service, and serviced using the target crawler, to the targeted website on crawl original crawler Data, wherein at least one is packaged in the crawler service module with the crawler service of service form encapsulation；

S3, the original crawler data crawled are stored to preset first memory module.

As described in above-mentioned steps S1, above-mentioned crawler task includes as targeted website and the crawling requirement of the task.It is above-mentioned Targeted website is that this crawls the data source of data；Above-mentioned crawl requires to be the type for crawling the requirement of data, for example specify The data etc. of function are specified in data, targeted website.It includes a variety of for obtaining the mode of crawler task, for example reception user is directly defeated The crawler task entered, or receive the crawler task dispatching that system generates.In one crawler task crawl requirement may include it is more It is a, for example require to crawl logon data, and require to crawl image recognition data of identifying code etc..

As described in above-mentioned steps S2, above-mentioned crawler service is to refer to complete the corresponding service for crawling task.It is above-mentioned to climb One or more preset crawler services are provided in worm service module.Service in crawler service module is usually some correspondences The common service for crawling requirement, such as simulation Sign-On services, the image recognition service of identifying code, IP agent pool safeguard service Deng.In a specific embodiment, it is provided with an invocation list in crawler service module, is stored in list and is reflected in one-to-one Crawling for penetrating requires and crawls service, when getting after crawling requirement of crawler task, arrives first lookup and its phase in invocation list Same crawls requirement, then gets target according to mapping relations and crawls service, the target is finally called to crawl service.When above-mentioned Include in crawler task it is multiple crawl when require, while being called.Then target crawler service to mesh is utilized Mark website crawls data.

As described in above-mentioned steps S3, as by the data crawled storage into the first data memory module.Above-mentioned first deposits Storage module is generally a document storage system, and relative low price, can save storage aspect opens money.

In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising:

S101, the task release module send the crawler task to the crawler module in the form of message queue.

As described in above-mentioned steps S101, message queue is a container, sends crawler task using the form of message queue, Quickly lateral and distribution extension can be carried out when for large-scale crawler task, improve the processing capacity of crawler task.

In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory module After step S3, which comprises

S4, the original crawler data in first memory module are cleaned using data cleansing module, is obtained clear The first crawler data after washing, and the first crawler data are stored to preset second memory module.

As described in above-mentioned steps S4, the cleaning rule of above-mentioned data cleansing module includes a variety of, for example removes duplicate number According to, incomplete data of removal etc., the data of needs can also be filtered out, repeated data etc. is then removed.Above-mentioned second Memory module can be the subdata base being arranged in above-mentioned first memory module, for example be a text in the first memory module Part folder etc..In a specific embodiment, above-mentioned second memory module is a number independently of above-mentioned first memory module According to library, the cost of the second memory module is higher than above-mentioned first memory module, but more convenient to the management of data etc..Because The data volume of original crawler data is larger, so the first memory module that use cost is low, the first crawler data number after cleaning According to measure it is relatively fewer, so management easy to use, but higher cost the second memory module.

In one embodiment, the method that above-mentioned distributed reptile crawls data further include:

S5, the day that other modules in the distributed reptile system architecture are obtained using the log and error handling module Will data, and obtain the error log in the daily record data；

S6, the corresponding event of the error log is handled according to preset rules.

In the present embodiment, the method that above-mentioned distributed reptile crawls data is completed, above-mentioned distributed reptile is relied on System architecture is realized, is executed above-mentioned the step of such as cleaning original crawler data, is crawled the step of data, can generate corresponding day Will data, the application can get up these collection of log data, then utilize existing log analysis method, filter out each log Then error log in data finds corresponding event according to error log and carries out corresponding automatic words processing, such as automatically Repeat the step of generating error log etc..

In one embodiment, it is above-mentioned according to preset rules handle the corresponding event of the error log step S6 it Afterwards, comprising:

S7, the error reporting that the corresponding event is generated using the log and error handling module, and by the mistake Report is sent to preset mailbox.

As described in above-mentioned steps S7, as by error log, to result of the time-triggered protocol etc. according to preset requirement Mail Contents are generated, then send mail in preset mailbox.Above-mentioned mailbox can be the mailbox of specified developer.It is above-mentioned Mailbox can be multiple and different mailboxes, the corresponding developer of each mailbox, to facilitate developer to obtain wrong feelings in time Condition.Further, receive the receipt that each mailbox is opened, as long as receiving a receipt, will with the receipt it is not corresponding its Its withdrawing mail, after preventing multiple developers from seeing mail while handling identical problem.

In one embodiment, the above-mentioned step S7 that the corresponding event of the error log is handled according to preset rules, packet It includes:

S71, judge whether the event is crawler failure using the log and error handling module；

If S72, the event are crawler failures, the corresponding crawler task of the event is issued again.

As described in above-mentioned steps S71 and S72, when crawler failure, mail notification, record can be carried out to developer in time Lower error reason, and crawler task is rejoined in message queue by error handling logic, it is crawled again；It improves The stability of process and the function of carrying out automation O&M.

S8, judge whether to receive the incoming administration order of the Back Administration Module；

S9, if so, administration order described in priority processing.

In the present embodiment, above-mentioned Back Administration Module is monitored entire crawler system by way of management of webpage With management.Start crawler process in such a way that Back Administration Module can be used and upload script and configuration；It can also be observed that There is the crawler task of performance bottleneck, the scale of real-time extension crawler module；It can also be realized by Back Administration Module to being The monitoring of all crawler tasks and data analysis etc. in system.

The method that the distributed reptile of the embodiment of the present application crawls data is based on above-mentioned distributed reptile system architecture, should The mode that the design of framework is registered using HTTP service, different modules is isolated, and message is used between different modules The mode of queue carries out mutual access.The coupling between system module can be reduced using this design scheme, and message team The asynchronous message processing capacity of column can with the parallel ability of lifting system data processing, facilitate system when promoting processing capacity into Row is extending transversely.Crawler service module is set, it is interior for storing crawler service, the bottom demand of entire crawler system is carried out Encapsulation carries out modularization, and the processing of service reduces the workload of developer, and does not limit the exploitation language of developer Speech reduces ability need；The stability and extended capability of crawler system are promoted by architecture design, and it is extensive to be suitable for multitask Crawler system exploitation；Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.

Referring to Fig. 3, a kind of computer equipment is also provided in the embodiment of the present application, which can be above-mentioned pipe It manages server or the corresponding server of management node, internal structure can be as shown in Figure 3.The computer equipment includes logical Cross processor, memory, network interface and the database of system bus connection.Wherein, the processor of the Computer Design is used for Calculating and control ability are provided.The memory of the computer equipment includes non-volatile memory medium, built-in storage.This is non-volatile Property storage medium is stored with operating system, computer program and database.The internal memory is the behaviour in non-volatile memory medium The operation for making system and computer program provides environment.The database of the computer equipment is used for distributed storage crawler system frame The data such as each module of structure.The network interface of the computer equipment is used to communicate with external terminal by network connection.The meter To realize a kind of method that distributed reptile crawls data when calculation machine program is executed by processor.

Above-mentioned processor executes the method that above-mentioned distributed reptile crawls data, based on the above embodiment in distribution climb Worm system architecture, comprising: obtain crawler task using the task release module, and the crawler task is sent to described climb Erpoglyph block, the crawler task include targeted website and crawl requirement；After the crawler module gets the crawler task, arrive It is called in the crawler service module and requires corresponding target crawler to service with described crawl, and taken using the target crawler Business, to the targeted website on crawl original crawler data, wherein be packaged in the crawler service module at least one with clothes The crawler service of business form encapsulation；By the original crawler data storage crawled to preset first memory module.

In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising: described Business release module sends the crawler task to the crawler module in the form of message queue.

In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory module After step, which comprises carried out using data cleansing module to the original crawler data in first memory module Cleaning, the first crawler data after being cleaned, and the first crawler data are stored to preset second memory module.

In one embodiment, the method that above-mentioned distributed reptile crawls data further include: utilize the log and mistake Processing module obtains the daily record data of other modules in the distributed reptile system architecture, and obtains in the daily record data Error log；The corresponding event of the error log is handled according to preset rules.

In one embodiment, after the step of above-mentioned event corresponding according to the preset rules processing error log, It include: the error reporting of the corresponding event to be generated using the log and error handling module, and the false alarm is accused Give preset mailbox.

In one embodiment, the step of above-mentioned event corresponding according to the preset rules processing error log, comprising: Judge whether the event is that crawler is failed using the log and error handling module；If the event is crawler failure, The corresponding crawler task of the event is issued again.

In one embodiment, the method that above-mentioned distributed reptile crawls data, which is characterized in that the method is also wrapped It includes: judging whether to receive the incoming administration order of the Back Administration Module；If so, administration order described in priority processing.

It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tied The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.

The computer equipment of the embodiment of the present application, is based on above-mentioned distributed reptile system architecture, and the design of the framework uses The mode of HTTP service registration, different modules is isolated, and is carried out between different modules using the mode of message queue Mutual access.The coupling between system module can be reduced using this design scheme, and at the asynchronous message of message queue Reason ability can facilitate system to carry out when promoting processing capacity extending transversely with the parallel ability of lifting system data processing.If Crawler service module is set, it is interior for storing crawler service, the bottom demand of entire crawler system is packaged, module is carried out Change, the processing of service reduces the workload of developer, and does not limit the development language of developer, and reducing ability needs It asks；The stability and extended capability that crawler system is promoted by architecture design, are opened suitable for the large-scale crawler system of multitask Hair；Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.

One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculates Machine program realizes a kind of method that above-mentioned distributed reptile crawls data when being executed by processor, based on the above embodiment in point Cloth crawler system framework, comprising: obtain crawler task using the task release module, and the crawler task is sent to The crawler module, the crawler task include targeted website and crawl requirement；The crawler module gets the crawler and appoints After business, is called into the crawler service module and require corresponding target crawler to service with described crawl, and utilize the target Crawler service, to the targeted website on crawl original crawler data, wherein be packaged at least one in the crawler service module A crawler service with service form encapsulation；By the original crawler data storage crawled to preset first memory module.

The method that above-mentioned distributed reptile crawls data is based on above-mentioned distributed reptile system architecture, the design of the framework The mode registered using HTTP service, different modules is isolated, and the mode of message queue is used between different modules Carry out mutual access.The coupling between system module can be reduced using this design scheme, and the asynchronous of message queue disappears Ceasing processing capacity can facilitate system to carry out lateral expansion when promoting processing capacity with the parallel ability of lifting system data processing Exhibition.Crawler service module is set, it is interior to be serviced for storing crawler, the bottom demand of entire crawler system is packaged, into Row modularization, the processing of service reduce the workload of developer, and do not limit the development language of developer, reduce Ability need；The stability and extended capability of crawler system are promoted by architecture design, are suitable for the large-scale crawler of multitask System development；Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, Any reference used in provided herein and embodiment to memory, storage, database or other media, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

The foregoing is merely preferred embodiment of the present application, are not intended to limit the scope of the patents of the application, all utilizations Equivalent structure or equivalent flow shift made by present specification and accompanying drawing content is applied directly or indirectly in other correlations Technical field, similarly include in the scope of patent protection of the application.

Claims

1. a kind of distributed reptile system architecture, which is characterized in that the mode that the design of the framework is registered using HTTP service, Different modules is isolated, carries out mutual access, the framework using the mode of message queue between different modules Include:

Task release module, for issuing crawler task；

Crawler service module, for storing with crawler service different existing for service form, the different crawlers has been serviced At different crawler tasks；

Crawler module, for receiving the crawler task of the task release module publication, and according to the crawler task, described in Crawler service corresponding with the crawler task is called in crawler service module, is serviced using the crawler to targeted website and is carried out Movement is crawled, corresponding original crawler data are obtained；

First data memory module, for storing the original crawler data.

2. a kind of method that distributed reptile crawls data is based on distributed reptile system architecture as described in claim 1, It is characterized in that, comprising:

Crawler task is obtained using the task release module, and the crawler task is sent to the crawler module, it is described Crawler task includes targeted website and crawls requirement；

After the crawler module gets the crawler task, is called into the crawler service module and crawl requirement pair with described Answer target crawler service, and using the target crawler service, to the targeted website on crawl original crawler data, In, at least one is packaged in the crawler service module with the crawler service of service form encapsulation；

By the original crawler data storage crawled to preset first memory module.

3. the method that distributed reptile according to claim 2 crawls data, which is characterized in that described to appoint the crawler The step of business is sent to the crawler module, comprising:

4. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system System framework further includes data cleansing module and the second memory module, and the original crawler data that will be crawled are stored to default The first memory module the step of after, which comprises

The original crawler data in first memory module are cleaned using the data cleansing module, after obtaining cleaning The first crawler data, and by the first crawler data storage to preset second memory module.

5. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system System framework further includes log and error handling module, the method also includes:

The daily record data of other modules in the distributed reptile system architecture is obtained using the log and error handling module, And obtain the error log in the daily record data；

The corresponding event of the error log is handled according to preset rules.

6. the method that distributed reptile according to claim 5 crawls data, which is characterized in that described according to preset rules After the step of handling the error log corresponding event, comprising:

The error reporting of the corresponding event is generated using the log and error handling module, and the error reporting is sent To preset mailbox.

7. the method that distributed reptile according to claim 5 crawls data, which is characterized in that described according to preset rules The step of handling the error log corresponding event, comprising:

8. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system Framework of uniting further includes Back Administration Module, the method also includes:

If so, administration order described in priority processing.

9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In when the processor executes the computer program the step of any one of realization claim 2 to 8 the method.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claim 2 to 8 is realized when being executed by processor.