CN110457556A - Distributed reptile system architecture, the method and computer equipment for crawling data - Google Patents

Distributed reptile system architecture, the method and computer equipment for crawling data Download PDF

Info

Publication number
CN110457556A
CN110457556A CN201910601110.6A CN201910601110A CN110457556A CN 110457556 A CN110457556 A CN 110457556A CN 201910601110 A CN201910601110 A CN 201910601110A CN 110457556 A CN110457556 A CN 110457556A
Authority
CN
China
Prior art keywords
crawler
module
data
task
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910601110.6A
Other languages
Chinese (zh)
Other versions
CN110457556B (en
Inventor
车驰
李钢
权佳成
谭瑞
张瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Zhongtian Technology Consulting Co ltd
Original Assignee
Chongqing Financial Assets Exchange LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Financial Assets Exchange LLC filed Critical Chongqing Financial Assets Exchange LLC
Priority to CN201910601110.6A priority Critical patent/CN110457556B/en
Publication of CN110457556A publication Critical patent/CN110457556A/en
Application granted granted Critical
Publication of CN110457556B publication Critical patent/CN110457556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

This application discloses a kind of distributed reptile system architecture, the method and computer equipment of data are crawled, wherein method includes: to obtain crawler task using task release module, and crawler task is sent to crawler module;After crawler module gets crawler task, call and crawl into crawler service module corresponding target crawler is required to service, and serviced using target crawler, to targeted website on crawl original crawler data;By the original crawler data crawled storage to preset first memory module.The distributed reptile system architecture of the application, distributed reptile crawl method and computer equipment of data etc., crawler service module is set, the bottom demand of entire crawler system is packaged, carry out modularization, the processing of serviceization, the workload of developer is reduced, and does not limit the development language of developer, reduces ability need;The stability and extended capability of crawler system are promoted by architecture design.

Description

Distributed reptile system architecture, the method and computer equipment for crawling data
Technical field
This application involves data collecting field is arrived, especially relates to a kind of distributed reptile system architecture, crawls data Method and computer equipment.
Background technique
Current crawler Platform Designing is customized exploitation mainly for single business scenario, total between different crawlers It is to need the independent module for writing demand, which results in the stabilizations that most of crawler system does not account for whole system Property and versatility, the exploitation maintenance efficiency of developer are low.
Summary of the invention
The main purpose of the application is to provide a kind of distributed reptile system architecture, crawl the method for data and computer is set It is standby, it is intended to solve distributed reptile system mine in the prior art and build stability and poor universality, effect is safeguarded in the exploitation of developer The low problem of rate.
In order to achieve the above-mentioned object of the invention, the application proposes that a kind of distributed reptile system architecture, the design of the framework make With HTTP service register mode, different modules is isolated, between different modules using message queue mode into The mutual access of row, the framework include:
Task release module, for issuing crawler task;
Crawler service module, for storing with crawler service different existing for service form, different crawler clothes Different crawler tasks is completed in business;
Crawler module is arrived for receiving the crawler task of the task release module publication, and according to the crawler task Crawler service corresponding with the crawler task is called in the crawler service module, utilizes crawler service to targeted website It carries out crawling movement, obtains corresponding original crawler data;
First data memory module, for storing the original crawler data;
Data cleansing module, for cleaning the original crawler data in first data memory module, after obtaining screening The first crawler data;
Second data memory module, for storing the first crawler data;
Back Administration Module is used to form visualization interface, realizes human-computer interaction on the visualization interface;
Then log and error handling module are obtained for obtaining the log that other modules generate in the system architecture Error log in the day handles the corresponding event of the error log according to preset rules.
The application also provides a kind of method that distributed reptile crawls data, based on above-mentioned distributed reptile system tray Structure, comprising:
Crawler task is obtained using the task release module, and the crawler task is sent to the crawler module, The crawler task includes targeted website and crawls requirement;
After the crawler module gets the crawler task, calls into the crawler service module and wanted with described crawl Ask corresponding target crawler to service, and serviced using the target crawler, to the targeted website on crawl original crawler data, Wherein, at least one is packaged in the crawler service module with the crawler service of service form encapsulation;
By the original crawler data storage crawled to preset first memory module.
Further, the described the step of crawler task is sent to the crawler module, comprising:
The task release module sends the crawler task to the crawler module in the form of message queue.
Further, the step of original crawler data storage that will be crawled is to preset first memory module it Afterwards, which comprises
The original crawler data in first memory module are cleaned using data cleansing module, after obtaining cleaning The first crawler data, and by the first crawler data storage to preset second memory module.
Further, the method also includes:
The log of other modules in the distributed reptile system architecture is obtained using the log and error handling module Data, and obtain the error log in the daily record data;
The corresponding event of the error log is handled according to preset rules.
Further, after the step of event corresponding according to the preset rules processing error log, comprising:
Generate the error reporting of the corresponding event using the log and error handling module, and by the error reporting It is sent to preset mailbox.
Further, the step of event corresponding according to the preset rules processing error log, comprising:
Judge whether the event is that crawler is failed using the log and error handling module;
If the event is crawler failure, the corresponding crawler task of the event is issued again.
Further, the method, further includes:
Judge whether to receive the incoming administration order of the Back Administration Module;
If so, administration order described in priority processing.
The application also provides a kind of computer equipment, including memory and processor, and the memory is stored with computer The step of program, the processor realizes any of the above-described the method when executing the computer program.
The application also provides a kind of computer readable storage medium, is stored thereon with computer program, the computer journey The step of method described in any of the above embodiments is realized when sequence is executed by processor.
The distributed reptile system architecture of the application, distributed reptile crawl the method, computer equipment and calculating of data Machine readable storage medium storing program for executing, the mode that the design of above-mentioned framework is registered using HTTP service, different modules is isolated, different Module between using the mode of message queue carry out mutual access.Using this design scheme can reduce system module it Between coupling, and the asynchronous message processing capacity of message queue can facilitate system with the parallel ability of lifting system data processing It is carried out when promoting processing capacity extending transversely.Crawler service module is set, it is interior for storing crawler service, by entire crawler The bottom demand of system is packaged, and carries out modularization, the processing of service, reduces the workload of developer and unlimited The development language of developer processed reduces ability need;The stability and extended capability of crawler system are promoted by architecture design, Suitable for the large-scale crawler system exploitation of multitask;Visual Back Administration Module, so that the operation management of whole system It is more reliable efficient.
Detailed description of the invention
Fig. 1 is the structural schematic block diagram of the distributed reptile system architecture of one embodiment of the application;
Fig. 2 is the flow diagram that the distributed reptile of one embodiment of the application crawls the method for data;
Fig. 3 is the structural schematic block diagram for applying for the computer equipment of an embodiment.
The embodiments will be further described with reference to the accompanying drawings for realization, functional characteristics and the advantage of the application purpose.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not For limiting the application.
Referring to Fig.1, the application proposes that a kind of distributed reptile system architecture, the design of the framework use HTTP service first The mode of registration is isolated by different modules, carries out mutual visit using the mode of message queue between different modules It asks, the framework includes:
Task release module 10, for issuing crawler task;
Crawler service module 20, for storing with crawler service different existing for service form, the different crawlers Different crawler tasks is completed in service;
Crawler module 30, for receiving the crawler task of task release module publication, and according to the crawler task, Crawler service corresponding with the crawler task is called into the crawler service module, utilizes crawler service to target network Station carries out crawling movement, obtains corresponding original crawler data;
First data memory module 40, for storing the original crawler data;
Data cleansing module 50 is screened for cleaning the original crawler data in first data memory module The first crawler data afterwards;
Second data memory module 60, for storing the first crawler data;
Back Administration Module 70, is used to form visualization interface, realizes human-computer interaction on the visualization interface;
Then log and error handling module 80 are obtained for obtaining the log that other modules generate in the system architecture The error log in the day is taken, handles the corresponding event of the error log according to preset rules.
In the present embodiment, the mode that the design of above-mentioned framework is registered using HTTP service, by different module carry out every From carrying out mutual access using the mode of message queue between different modules.It can be reduced using this design scheme and be Unite module between coupling, and the asynchronous message processing capacity of message queue can with the parallel ability of lifting system data processing, System is facilitated to carry out when promoting processing capacity extending transversely.In above-mentioned framework, using the mode of docker containerization by system Environment, module service, storage system be packaged and are integrated, and the mode that script can be used carries out one-touch portion to system Administration, starting.When needing to be deployed to new environment, it is only necessary to container file be migrated to the migration for just completing system, transported The deployment of system can be completed in row starting script.In above-mentioned framework, crawlers are not compromised by first floor system development language It limits and unified language can only be used to be developed;The basis for using module can be provided in system for different development languages Software support;The written in code that crawler developer only needs to be performed service logic in this way forms the service of corresponding crawler, and by its In incoming crawler service module, so that it may complete exploitation, maintenance of entire crawlers etc..
Referring to Fig. 2, the embodiment of the present application also provides a kind of method that distributed reptile crawls data, based on such as above-mentioned implementation The distributed reptile system architecture of example, comprising steps of
S1, crawler task is obtained using the task release module, and the crawler task is sent to the crawler mould Block, the crawler task include targeted website and crawl requirement;
After S2, the crawler module get the crawler task, calls into the crawler service module and climbed with described Take and corresponding target crawler required to service, and serviced using the target crawler, to the targeted website on crawl original crawler Data, wherein at least one is packaged in the crawler service module with the crawler service of service form encapsulation;
S3, the original crawler data crawled are stored to preset first memory module.
As described in above-mentioned steps S1, above-mentioned crawler task includes as targeted website and the crawling requirement of the task.It is above-mentioned Targeted website is that this crawls the data source of data;Above-mentioned crawl requires to be the type for crawling the requirement of data, for example specify The data etc. of function are specified in data, targeted website.It includes a variety of for obtaining the mode of crawler task, for example reception user is directly defeated The crawler task entered, or receive the crawler task dispatching that system generates.In one crawler task crawl requirement may include it is more It is a, for example require to crawl logon data, and require to crawl image recognition data of identifying code etc..
As described in above-mentioned steps S2, above-mentioned crawler service is to refer to complete the corresponding service for crawling task.It is above-mentioned to climb One or more preset crawler services are provided in worm service module.Service in crawler service module is usually some correspondences The common service for crawling requirement, such as simulation Sign-On services, the image recognition service of identifying code, IP agent pool safeguard service Deng.In a specific embodiment, it is provided with an invocation list in crawler service module, is stored in list and is reflected in one-to-one Crawling for penetrating requires and crawls service, when getting after crawling requirement of crawler task, arrives first lookup and its phase in invocation list Same crawls requirement, then gets target according to mapping relations and crawls service, the target is finally called to crawl service.When above-mentioned Include in crawler task it is multiple crawl when require, while being called.Then target crawler service to mesh is utilized Mark website crawls data.
As described in above-mentioned steps S3, as by the data crawled storage into the first data memory module.Above-mentioned first deposits Storage module is generally a document storage system, and relative low price, can save storage aspect opens money.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising:
S101, the task release module send the crawler task to the crawler module in the form of message queue.
As described in above-mentioned steps S101, message queue is a container, sends crawler task using the form of message queue, Quickly lateral and distribution extension can be carried out when for large-scale crawler task, improve the processing capacity of crawler task.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory module After step S3, which comprises
S4, the original crawler data in first memory module are cleaned using data cleansing module, is obtained clear The first crawler data after washing, and the first crawler data are stored to preset second memory module.
As described in above-mentioned steps S4, the cleaning rule of above-mentioned data cleansing module includes a variety of, for example removes duplicate number According to, incomplete data of removal etc., the data of needs can also be filtered out, repeated data etc. is then removed.Above-mentioned second Memory module can be the subdata base being arranged in above-mentioned first memory module, for example be a text in the first memory module Part folder etc..In a specific embodiment, above-mentioned second memory module is a number independently of above-mentioned first memory module According to library, the cost of the second memory module is higher than above-mentioned first memory module, but more convenient to the management of data etc..Because The data volume of original crawler data is larger, so the first memory module that use cost is low, the first crawler data number after cleaning According to measure it is relatively fewer, so management easy to use, but higher cost the second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include:
S5, the day that other modules in the distributed reptile system architecture are obtained using the log and error handling module Will data, and obtain the error log in the daily record data;
S6, the corresponding event of the error log is handled according to preset rules.
In the present embodiment, the method that above-mentioned distributed reptile crawls data is completed, above-mentioned distributed reptile is relied on System architecture is realized, is executed above-mentioned the step of such as cleaning original crawler data, is crawled the step of data, can generate corresponding day Will data, the application can get up these collection of log data, then utilize existing log analysis method, filter out each log Then error log in data finds corresponding event according to error log and carries out corresponding automatic words processing, such as automatically Repeat the step of generating error log etc..
In one embodiment, it is above-mentioned according to preset rules handle the corresponding event of the error log step S6 it Afterwards, comprising:
S7, the error reporting that the corresponding event is generated using the log and error handling module, and by the mistake Report is sent to preset mailbox.
As described in above-mentioned steps S7, as by error log, to result of the time-triggered protocol etc. according to preset requirement Mail Contents are generated, then send mail in preset mailbox.Above-mentioned mailbox can be the mailbox of specified developer.It is above-mentioned Mailbox can be multiple and different mailboxes, the corresponding developer of each mailbox, to facilitate developer to obtain wrong feelings in time Condition.Further, receive the receipt that each mailbox is opened, as long as receiving a receipt, will with the receipt it is not corresponding its Its withdrawing mail, after preventing multiple developers from seeing mail while handling identical problem.
In one embodiment, the above-mentioned step S7 that the corresponding event of the error log is handled according to preset rules, packet It includes:
S71, judge whether the event is crawler failure using the log and error handling module;
If S72, the event are crawler failures, the corresponding crawler task of the event is issued again.
As described in above-mentioned steps S71 and S72, when crawler failure, mail notification, record can be carried out to developer in time Lower error reason, and crawler task is rejoined in message queue by error handling logic, it is crawled again;It improves The stability of process and the function of carrying out automation O&M.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include:
S8, judge whether to receive the incoming administration order of the Back Administration Module;
S9, if so, administration order described in priority processing.
In the present embodiment, above-mentioned Back Administration Module is monitored entire crawler system by way of management of webpage With management.Start crawler process in such a way that Back Administration Module can be used and upload script and configuration;It can also be observed that There is the crawler task of performance bottleneck, the scale of real-time extension crawler module;It can also be realized by Back Administration Module to being The monitoring of all crawler tasks and data analysis etc. in system.
The method that the distributed reptile of the embodiment of the present application crawls data is based on above-mentioned distributed reptile system architecture, should The mode that the design of framework is registered using HTTP service, different modules is isolated, and message is used between different modules The mode of queue carries out mutual access.The coupling between system module can be reduced using this design scheme, and message team The asynchronous message processing capacity of column can with the parallel ability of lifting system data processing, facilitate system when promoting processing capacity into Row is extending transversely.Crawler service module is set, it is interior for storing crawler service, the bottom demand of entire crawler system is carried out Encapsulation carries out modularization, and the processing of service reduces the workload of developer, and does not limit the exploitation language of developer Speech reduces ability need;The stability and extended capability of crawler system are promoted by architecture design, and it is extensive to be suitable for multitask Crawler system exploitation;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
Referring to Fig. 3, a kind of computer equipment is also provided in the embodiment of the present application, which can be above-mentioned pipe It manages server or the corresponding server of management node, internal structure can be as shown in Figure 3.The computer equipment includes logical Cross processor, memory, network interface and the database of system bus connection.Wherein, the processor of the Computer Design is used for Calculating and control ability are provided.The memory of the computer equipment includes non-volatile memory medium, built-in storage.This is non-volatile Property storage medium is stored with operating system, computer program and database.The internal memory is the behaviour in non-volatile memory medium The operation for making system and computer program provides environment.The database of the computer equipment is used for distributed storage crawler system frame The data such as each module of structure.The network interface of the computer equipment is used to communicate with external terminal by network connection.The meter To realize a kind of method that distributed reptile crawls data when calculation machine program is executed by processor.
Above-mentioned processor executes the method that above-mentioned distributed reptile crawls data, based on the above embodiment in distribution climb Worm system architecture, comprising: obtain crawler task using the task release module, and the crawler task is sent to described climb Erpoglyph block, the crawler task include targeted website and crawl requirement;After the crawler module gets the crawler task, arrive It is called in the crawler service module and requires corresponding target crawler to service with described crawl, and taken using the target crawler Business, to the targeted website on crawl original crawler data, wherein be packaged in the crawler service module at least one with clothes The crawler service of business form encapsulation;By the original crawler data storage crawled to preset first memory module.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising: described Business release module sends the crawler task to the crawler module in the form of message queue.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory module After step, which comprises carried out using data cleansing module to the original crawler data in first memory module Cleaning, the first crawler data after being cleaned, and the first crawler data are stored to preset second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include: utilize the log and mistake Processing module obtains the daily record data of other modules in the distributed reptile system architecture, and obtains in the daily record data Error log;The corresponding event of the error log is handled according to preset rules.
In one embodiment, after the step of above-mentioned event corresponding according to the preset rules processing error log, It include: the error reporting of the corresponding event to be generated using the log and error handling module, and the false alarm is accused Give preset mailbox.
In one embodiment, the step of above-mentioned event corresponding according to the preset rules processing error log, comprising: Judge whether the event is that crawler is failed using the log and error handling module;If the event is crawler failure, The corresponding crawler task of the event is issued again.
In one embodiment, the method that above-mentioned distributed reptile crawls data, which is characterized in that the method is also wrapped It includes: judging whether to receive the incoming administration order of the Back Administration Module;If so, administration order described in priority processing.
It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tied The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.
The computer equipment of the embodiment of the present application, is based on above-mentioned distributed reptile system architecture, and the design of the framework uses The mode of HTTP service registration, different modules is isolated, and is carried out between different modules using the mode of message queue Mutual access.The coupling between system module can be reduced using this design scheme, and at the asynchronous message of message queue Reason ability can facilitate system to carry out when promoting processing capacity extending transversely with the parallel ability of lifting system data processing.If Crawler service module is set, it is interior for storing crawler service, the bottom demand of entire crawler system is packaged, module is carried out Change, the processing of service reduces the workload of developer, and does not limit the development language of developer, and reducing ability needs It asks;The stability and extended capability that crawler system is promoted by architecture design, are opened suitable for the large-scale crawler system of multitask Hair;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculates Machine program realizes a kind of method that above-mentioned distributed reptile crawls data when being executed by processor, based on the above embodiment in point Cloth crawler system framework, comprising: obtain crawler task using the task release module, and the crawler task is sent to The crawler module, the crawler task include targeted website and crawl requirement;The crawler module gets the crawler and appoints After business, is called into the crawler service module and require corresponding target crawler to service with described crawl, and utilize the target Crawler service, to the targeted website on crawl original crawler data, wherein be packaged at least one in the crawler service module A crawler service with service form encapsulation;By the original crawler data storage crawled to preset first memory module.
The method that above-mentioned distributed reptile crawls data is based on above-mentioned distributed reptile system architecture, the design of the framework The mode registered using HTTP service, different modules is isolated, and the mode of message queue is used between different modules Carry out mutual access.The coupling between system module can be reduced using this design scheme, and the asynchronous of message queue disappears Ceasing processing capacity can facilitate system to carry out lateral expansion when promoting processing capacity with the parallel ability of lifting system data processing Exhibition.Crawler service module is set, it is interior to be serviced for storing crawler, the bottom demand of entire crawler system is packaged, into Row modularization, the processing of service reduce the workload of developer, and do not limit the development language of developer, reduce Ability need;The stability and extended capability of crawler system are promoted by architecture design, are suitable for the large-scale crawler of multitask System development;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising: described Business release module sends the crawler task to the crawler module in the form of message queue.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory module After step, which comprises carried out using data cleansing module to the original crawler data in first memory module Cleaning, the first crawler data after being cleaned, and the first crawler data are stored to preset second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include: utilize the log and mistake Processing module obtains the daily record data of other modules in the distributed reptile system architecture, and obtains in the daily record data Error log;The corresponding event of the error log is handled according to preset rules.
In one embodiment, after the step of above-mentioned event corresponding according to the preset rules processing error log, It include: the error reporting of the corresponding event to be generated using the log and error handling module, and the false alarm is accused Give preset mailbox.
In one embodiment, the step of above-mentioned event corresponding according to the preset rules processing error log, comprising: Judge whether the event is that crawler is failed using the log and error handling module;If the event is crawler failure, The corresponding crawler task of the event is issued again.
In one embodiment, the method that above-mentioned distributed reptile crawls data, which is characterized in that the method is also wrapped It includes: judging whether to receive the incoming administration order of the Back Administration Module;If so, administration order described in priority processing.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, Any reference used in provided herein and embodiment to memory, storage, database or other media, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
The foregoing is merely preferred embodiment of the present application, are not intended to limit the scope of the patents of the application, all utilizations Equivalent structure or equivalent flow shift made by present specification and accompanying drawing content is applied directly or indirectly in other correlations Technical field, similarly include in the scope of patent protection of the application.

Claims (10)

1. a kind of distributed reptile system architecture, which is characterized in that the mode that the design of the framework is registered using HTTP service, Different modules is isolated, carries out mutual access, the framework using the mode of message queue between different modules Include:
Task release module, for issuing crawler task;
Crawler service module, for storing with crawler service different existing for service form, the different crawlers has been serviced At different crawler tasks;
Crawler module, for receiving the crawler task of the task release module publication, and according to the crawler task, described in Crawler service corresponding with the crawler task is called in crawler service module, is serviced using the crawler to targeted website and is carried out Movement is crawled, corresponding original crawler data are obtained;
First data memory module, for storing the original crawler data.
2. a kind of method that distributed reptile crawls data is based on distributed reptile system architecture as described in claim 1, It is characterized in that, comprising:
Crawler task is obtained using the task release module, and the crawler task is sent to the crawler module, it is described Crawler task includes targeted website and crawls requirement;
After the crawler module gets the crawler task, is called into the crawler service module and crawl requirement pair with described Answer target crawler service, and using the target crawler service, to the targeted website on crawl original crawler data, In, at least one is packaged in the crawler service module with the crawler service of service form encapsulation;
By the original crawler data storage crawled to preset first memory module.
3. the method that distributed reptile according to claim 2 crawls data, which is characterized in that described to appoint the crawler The step of business is sent to the crawler module, comprising:
The task release module sends the crawler task to the crawler module in the form of message queue.
4. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system System framework further includes data cleansing module and the second memory module, and the original crawler data that will be crawled are stored to default The first memory module the step of after, which comprises
The original crawler data in first memory module are cleaned using the data cleansing module, after obtaining cleaning The first crawler data, and by the first crawler data storage to preset second memory module.
5. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system System framework further includes log and error handling module, the method also includes:
The daily record data of other modules in the distributed reptile system architecture is obtained using the log and error handling module, And obtain the error log in the daily record data;
The corresponding event of the error log is handled according to preset rules.
6. the method that distributed reptile according to claim 5 crawls data, which is characterized in that described according to preset rules After the step of handling the error log corresponding event, comprising:
The error reporting of the corresponding event is generated using the log and error handling module, and the error reporting is sent To preset mailbox.
7. the method that distributed reptile according to claim 5 crawls data, which is characterized in that described according to preset rules The step of handling the error log corresponding event, comprising:
Judge whether the event is that crawler is failed using the log and error handling module;
If the event is crawler failure, the corresponding crawler task of the event is issued again.
8. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system Framework of uniting further includes Back Administration Module, the method also includes:
Judge whether to receive the incoming administration order of the Back Administration Module;
If so, administration order described in priority processing.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In when the processor executes the computer program the step of any one of realization claim 2 to 8 the method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claim 2 to 8 is realized when being executed by processor.
CN201910601110.6A 2019-07-04 2019-07-04 Distributed crawler system architecture, method for crawling data and computer equipment Active CN110457556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910601110.6A CN110457556B (en) 2019-07-04 2019-07-04 Distributed crawler system architecture, method for crawling data and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910601110.6A CN110457556B (en) 2019-07-04 2019-07-04 Distributed crawler system architecture, method for crawling data and computer equipment

Publications (2)

Publication Number Publication Date
CN110457556A true CN110457556A (en) 2019-11-15
CN110457556B CN110457556B (en) 2023-11-14

Family

ID=68482277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910601110.6A Active CN110457556B (en) 2019-07-04 2019-07-04 Distributed crawler system architecture, method for crawling data and computer equipment

Country Status (1)

Country Link
CN (1) CN110457556B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929127A (en) * 2019-12-05 2020-03-27 广州市原象信息科技有限公司 Method for analyzing Taobao live broadcast putting effect and computer equipment
CN110929128A (en) * 2019-12-11 2020-03-27 北京启迪区块链科技发展有限公司 Data crawling method, device, equipment and medium
CN111143336A (en) * 2019-11-27 2020-05-12 三盟科技股份有限公司 College scientific research data management-oriented web crawler management method and platform
CN111192155A (en) * 2019-12-25 2020-05-22 杭州龙席网络科技股份有限公司 Social media inquiry plate identification and recommendation method based on SAAS
CN111241366A (en) * 2019-12-25 2020-06-05 杭州龙席网络科技股份有限公司 Client social media monitoring method based on SAAS
CN111241373A (en) * 2020-02-20 2020-06-05 山东爱城市网信息技术有限公司 Webpage crawler system based on micro-service and implementation method
CN111708931A (en) * 2020-06-06 2020-09-25 谢国柱 Big data acquisition method based on mobile internet and artificial intelligence cloud service platform
CN112597367A (en) * 2020-11-30 2021-04-02 国网北京市电力公司 Data information fusion system and target decision generation method
CN112650908A (en) * 2020-12-25 2021-04-13 百果园技术(新加坡)有限公司 Data processing method, system and device based on network theme crawler

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100269168A1 (en) * 2009-04-21 2010-10-21 Brightcloud Inc. System And Method For Developing A Risk Profile For An Internet Service
US20110208848A1 (en) * 2008-08-05 2011-08-25 Zhiyong Feng Network system of web services based on semantics and relationships
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method
CN107135092A (en) * 2017-03-15 2017-09-05 浙江工业大学 A kind of Web service clustering method towards global social interaction server net
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database
CN108170551A (en) * 2018-01-03 2018-06-15 深圳壹账通智能科技有限公司 Front and back end error handling method, server and storage medium based on crawler system
CN109492149A (en) * 2018-11-29 2019-03-19 深圳墨世科技有限公司 Crawler task processing method and device
CN109508422A (en) * 2018-12-05 2019-03-22 南京邮电大学 The height of multithreading intelligent scheduling is hidden crawler system
CN109815384A (en) * 2019-01-29 2019-05-28 携程旅游信息技术(上海)有限公司 Method, system, equipment and the storage medium that crawler is realized

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110208848A1 (en) * 2008-08-05 2011-08-25 Zhiyong Feng Network system of web services based on semantics and relationships
US20100269168A1 (en) * 2009-04-21 2010-10-21 Brightcloud Inc. System And Method For Developing A Risk Profile For An Internet Service
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method
CN107135092A (en) * 2017-03-15 2017-09-05 浙江工业大学 A kind of Web service clustering method towards global social interaction server net
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database
CN108170551A (en) * 2018-01-03 2018-06-15 深圳壹账通智能科技有限公司 Front and back end error handling method, server and storage medium based on crawler system
CN109492149A (en) * 2018-11-29 2019-03-19 深圳墨世科技有限公司 Crawler task processing method and device
CN109508422A (en) * 2018-12-05 2019-03-22 南京邮电大学 The height of multithreading intelligent scheduling is hidden crawler system
CN109815384A (en) * 2019-01-29 2019-05-28 携程旅游信息技术(上海)有限公司 Method, system, equipment and the storage medium that crawler is realized

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
董禹龙等: "主动获取式的分布式网络爬虫集群方法研究", 《计算机科学》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143336A (en) * 2019-11-27 2020-05-12 三盟科技股份有限公司 College scientific research data management-oriented web crawler management method and platform
CN110929127A (en) * 2019-12-05 2020-03-27 广州市原象信息科技有限公司 Method for analyzing Taobao live broadcast putting effect and computer equipment
CN110929128A (en) * 2019-12-11 2020-03-27 北京启迪区块链科技发展有限公司 Data crawling method, device, equipment and medium
CN111192155A (en) * 2019-12-25 2020-05-22 杭州龙席网络科技股份有限公司 Social media inquiry plate identification and recommendation method based on SAAS
CN111241366A (en) * 2019-12-25 2020-06-05 杭州龙席网络科技股份有限公司 Client social media monitoring method based on SAAS
CN111241373A (en) * 2020-02-20 2020-06-05 山东爱城市网信息技术有限公司 Webpage crawler system based on micro-service and implementation method
CN111708931A (en) * 2020-06-06 2020-09-25 谢国柱 Big data acquisition method based on mobile internet and artificial intelligence cloud service platform
CN112597367A (en) * 2020-11-30 2021-04-02 国网北京市电力公司 Data information fusion system and target decision generation method
CN112650908A (en) * 2020-12-25 2021-04-13 百果园技术(新加坡)有限公司 Data processing method, system and device based on network theme crawler

Also Published As

Publication number Publication date
CN110457556B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
CN110457556A (en) Distributed reptile system architecture, the method and computer equipment for crawling data
CN105243159B (en) A kind of distributed network crawler system based on visualization script editing machine
Jain et al. Cloud to edge: distributed deployment of process-aware IoT applications
US10956013B2 (en) User interface for automated flows within a cloud based developmental platform
US10147066B2 (en) Business process framework
CN111404759A (en) Service detection method, rule configuration method, related device and medium
CN108810025A (en) A kind of security assessment method of darknet, server and computer-readable medium
CN102710793A (en) Network printing system based on cloud computing and data storage method thereof
Gupta et al. A QoS-supported approach using fault detection and tolerance for achieving reliability in dynamic orchestration of web services
CN111143167A (en) Alarm merging method, device, equipment and storage medium for multiple platforms
CN116048467A (en) Micro-service development platform and business system development method
CN102508773A (en) Method and device for monitoring WEB service system simulation based on Internet explorer (IE) kernel
CN112738138A (en) Cloud security hosting method, device, equipment and storage medium
CN106202399A (en) Method for implementing data management system of big data
CN107995062B (en) RPC-based traffic management integrated platform remote service real-time processing method and system
CN103118248B (en) Monitoring method, monitoring agent, monitoring server and system
Prist et al. Cyber-physical manufacturing systems: An architecture for sensor integration, production line simulation and cloud services
CN105262845B (en) A kind of document transmission processing method and system
CN102694676B (en) Management system, management equipment and management method
CN116136801B (en) Cloud platform data processing method and device, electronic equipment and storage medium
Platenius-Mohr et al. An analysis of use cases for the asset administration shell in the context of edge computing
CN106886453A (en) Information processing method, device and system for asynchronous multiple tracks
CN111447273A (en) Cloud processing system and data processing method based on cloud processing system
Ritter et al. Modeling exception flows in integration systems
CN110333930A (en) Digital Platform system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240326

Address after: Room 101-1, Building 2, No. 95, Daguan Middle Road, Tianhe District, Guangzhou, Guangdong 510000 (office only)

Patentee after: Guangzhou Zhongtian Technology Consulting Co.,Ltd.

Country or region after: China

Address before: 400010 38 / F, 39 / F, unit 1, 99 Wuyi Road, Yuzhong District, Chongqing

Patentee before: CHONGQING FINANCIAL ASSETS EXCHANGE Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right