CN110457556A - Distributed reptile system architecture, the method and computer equipment for crawling data - Google Patents
Distributed reptile system architecture, the method and computer equipment for crawling data Download PDFInfo
- Publication number
- CN110457556A CN110457556A CN201910601110.6A CN201910601110A CN110457556A CN 110457556 A CN110457556 A CN 110457556A CN 201910601110 A CN201910601110 A CN 201910601110A CN 110457556 A CN110457556 A CN 110457556A
- Authority
- CN
- China
- Prior art keywords
- crawler
- module
- data
- task
- service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 241000270322 Lepidosauria Species 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000009193 crawling Effects 0.000 title description 10
- 238000012545 processing Methods 0.000 claims abstract description 34
- 238000013461 design Methods 0.000 claims abstract description 19
- 238000004590 computer program Methods 0.000 claims description 12
- 238000004140 cleaning Methods 0.000 claims description 9
- 238000013500 data storage Methods 0.000 claims description 7
- 238000005538 encapsulation Methods 0.000 claims description 6
- 238000011161 development Methods 0.000 abstract description 5
- 238000007726 management method Methods 0.000 description 9
- 230000008878 coupling Effects 0.000 description 5
- 238000010168 coupling process Methods 0.000 description 5
- 238000005859 coupling reaction Methods 0.000 description 5
- 230000001737 promoting effect Effects 0.000 description 5
- 230000018109 developmental process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 238000012800 visualization Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 230000033772 system development Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000149 penetrating effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Debugging And Monitoring (AREA)
Abstract
This application discloses a kind of distributed reptile system architecture, the method and computer equipment of data are crawled, wherein method includes: to obtain crawler task using task release module, and crawler task is sent to crawler module;After crawler module gets crawler task, call and crawl into crawler service module corresponding target crawler is required to service, and serviced using target crawler, to targeted website on crawl original crawler data;By the original crawler data crawled storage to preset first memory module.The distributed reptile system architecture of the application, distributed reptile crawl method and computer equipment of data etc., crawler service module is set, the bottom demand of entire crawler system is packaged, carry out modularization, the processing of serviceization, the workload of developer is reduced, and does not limit the development language of developer, reduces ability need;The stability and extended capability of crawler system are promoted by architecture design.
Description
Technical field
This application involves data collecting field is arrived, especially relates to a kind of distributed reptile system architecture, crawls data
Method and computer equipment.
Background technique
Current crawler Platform Designing is customized exploitation mainly for single business scenario, total between different crawlers
It is to need the independent module for writing demand, which results in the stabilizations that most of crawler system does not account for whole system
Property and versatility, the exploitation maintenance efficiency of developer are low.
Summary of the invention
The main purpose of the application is to provide a kind of distributed reptile system architecture, crawl the method for data and computer is set
It is standby, it is intended to solve distributed reptile system mine in the prior art and build stability and poor universality, effect is safeguarded in the exploitation of developer
The low problem of rate.
In order to achieve the above-mentioned object of the invention, the application proposes that a kind of distributed reptile system architecture, the design of the framework make
With HTTP service register mode, different modules is isolated, between different modules using message queue mode into
The mutual access of row, the framework include:
Task release module, for issuing crawler task;
Crawler service module, for storing with crawler service different existing for service form, different crawler clothes
Different crawler tasks is completed in business;
Crawler module is arrived for receiving the crawler task of the task release module publication, and according to the crawler task
Crawler service corresponding with the crawler task is called in the crawler service module, utilizes crawler service to targeted website
It carries out crawling movement, obtains corresponding original crawler data;
First data memory module, for storing the original crawler data;
Data cleansing module, for cleaning the original crawler data in first data memory module, after obtaining screening
The first crawler data;
Second data memory module, for storing the first crawler data;
Back Administration Module is used to form visualization interface, realizes human-computer interaction on the visualization interface;
Then log and error handling module are obtained for obtaining the log that other modules generate in the system architecture
Error log in the day handles the corresponding event of the error log according to preset rules.
The application also provides a kind of method that distributed reptile crawls data, based on above-mentioned distributed reptile system tray
Structure, comprising:
Crawler task is obtained using the task release module, and the crawler task is sent to the crawler module,
The crawler task includes targeted website and crawls requirement;
After the crawler module gets the crawler task, calls into the crawler service module and wanted with described crawl
Ask corresponding target crawler to service, and serviced using the target crawler, to the targeted website on crawl original crawler data,
Wherein, at least one is packaged in the crawler service module with the crawler service of service form encapsulation;
By the original crawler data storage crawled to preset first memory module.
Further, the described the step of crawler task is sent to the crawler module, comprising:
The task release module sends the crawler task to the crawler module in the form of message queue.
Further, the step of original crawler data storage that will be crawled is to preset first memory module it
Afterwards, which comprises
The original crawler data in first memory module are cleaned using data cleansing module, after obtaining cleaning
The first crawler data, and by the first crawler data storage to preset second memory module.
Further, the method also includes:
The log of other modules in the distributed reptile system architecture is obtained using the log and error handling module
Data, and obtain the error log in the daily record data;
The corresponding event of the error log is handled according to preset rules.
Further, after the step of event corresponding according to the preset rules processing error log, comprising:
Generate the error reporting of the corresponding event using the log and error handling module, and by the error reporting
It is sent to preset mailbox.
Further, the step of event corresponding according to the preset rules processing error log, comprising:
Judge whether the event is that crawler is failed using the log and error handling module;
If the event is crawler failure, the corresponding crawler task of the event is issued again.
Further, the method, further includes:
Judge whether to receive the incoming administration order of the Back Administration Module;
If so, administration order described in priority processing.
The application also provides a kind of computer equipment, including memory and processor, and the memory is stored with computer
The step of program, the processor realizes any of the above-described the method when executing the computer program.
The application also provides a kind of computer readable storage medium, is stored thereon with computer program, the computer journey
The step of method described in any of the above embodiments is realized when sequence is executed by processor.
The distributed reptile system architecture of the application, distributed reptile crawl the method, computer equipment and calculating of data
Machine readable storage medium storing program for executing, the mode that the design of above-mentioned framework is registered using HTTP service, different modules is isolated, different
Module between using the mode of message queue carry out mutual access.Using this design scheme can reduce system module it
Between coupling, and the asynchronous message processing capacity of message queue can facilitate system with the parallel ability of lifting system data processing
It is carried out when promoting processing capacity extending transversely.Crawler service module is set, it is interior for storing crawler service, by entire crawler
The bottom demand of system is packaged, and carries out modularization, the processing of service, reduces the workload of developer and unlimited
The development language of developer processed reduces ability need;The stability and extended capability of crawler system are promoted by architecture design,
Suitable for the large-scale crawler system exploitation of multitask;Visual Back Administration Module, so that the operation management of whole system
It is more reliable efficient.
Detailed description of the invention
Fig. 1 is the structural schematic block diagram of the distributed reptile system architecture of one embodiment of the application;
Fig. 2 is the flow diagram that the distributed reptile of one embodiment of the application crawls the method for data;
Fig. 3 is the structural schematic block diagram for applying for the computer equipment of an embodiment.
The embodiments will be further described with reference to the accompanying drawings for realization, functional characteristics and the advantage of the application purpose.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood
The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not
For limiting the application.
Referring to Fig.1, the application proposes that a kind of distributed reptile system architecture, the design of the framework use HTTP service first
The mode of registration is isolated by different modules, carries out mutual visit using the mode of message queue between different modules
It asks, the framework includes:
Task release module 10, for issuing crawler task;
Crawler service module 20, for storing with crawler service different existing for service form, the different crawlers
Different crawler tasks is completed in service;
Crawler module 30, for receiving the crawler task of task release module publication, and according to the crawler task,
Crawler service corresponding with the crawler task is called into the crawler service module, utilizes crawler service to target network
Station carries out crawling movement, obtains corresponding original crawler data;
First data memory module 40, for storing the original crawler data;
Data cleansing module 50 is screened for cleaning the original crawler data in first data memory module
The first crawler data afterwards;
Second data memory module 60, for storing the first crawler data;
Back Administration Module 70, is used to form visualization interface, realizes human-computer interaction on the visualization interface;
Then log and error handling module 80 are obtained for obtaining the log that other modules generate in the system architecture
The error log in the day is taken, handles the corresponding event of the error log according to preset rules.
In the present embodiment, the mode that the design of above-mentioned framework is registered using HTTP service, by different module carry out every
From carrying out mutual access using the mode of message queue between different modules.It can be reduced using this design scheme and be
Unite module between coupling, and the asynchronous message processing capacity of message queue can with the parallel ability of lifting system data processing,
System is facilitated to carry out when promoting processing capacity extending transversely.In above-mentioned framework, using the mode of docker containerization by system
Environment, module service, storage system be packaged and are integrated, and the mode that script can be used carries out one-touch portion to system
Administration, starting.When needing to be deployed to new environment, it is only necessary to container file be migrated to the migration for just completing system, transported
The deployment of system can be completed in row starting script.In above-mentioned framework, crawlers are not compromised by first floor system development language
It limits and unified language can only be used to be developed;The basis for using module can be provided in system for different development languages
Software support;The written in code that crawler developer only needs to be performed service logic in this way forms the service of corresponding crawler, and by its
In incoming crawler service module, so that it may complete exploitation, maintenance of entire crawlers etc..
Referring to Fig. 2, the embodiment of the present application also provides a kind of method that distributed reptile crawls data, based on such as above-mentioned implementation
The distributed reptile system architecture of example, comprising steps of
S1, crawler task is obtained using the task release module, and the crawler task is sent to the crawler mould
Block, the crawler task include targeted website and crawl requirement;
After S2, the crawler module get the crawler task, calls into the crawler service module and climbed with described
Take and corresponding target crawler required to service, and serviced using the target crawler, to the targeted website on crawl original crawler
Data, wherein at least one is packaged in the crawler service module with the crawler service of service form encapsulation;
S3, the original crawler data crawled are stored to preset first memory module.
As described in above-mentioned steps S1, above-mentioned crawler task includes as targeted website and the crawling requirement of the task.It is above-mentioned
Targeted website is that this crawls the data source of data;Above-mentioned crawl requires to be the type for crawling the requirement of data, for example specify
The data etc. of function are specified in data, targeted website.It includes a variety of for obtaining the mode of crawler task, for example reception user is directly defeated
The crawler task entered, or receive the crawler task dispatching that system generates.In one crawler task crawl requirement may include it is more
It is a, for example require to crawl logon data, and require to crawl image recognition data of identifying code etc..
As described in above-mentioned steps S2, above-mentioned crawler service is to refer to complete the corresponding service for crawling task.It is above-mentioned to climb
One or more preset crawler services are provided in worm service module.Service in crawler service module is usually some correspondences
The common service for crawling requirement, such as simulation Sign-On services, the image recognition service of identifying code, IP agent pool safeguard service
Deng.In a specific embodiment, it is provided with an invocation list in crawler service module, is stored in list and is reflected in one-to-one
Crawling for penetrating requires and crawls service, when getting after crawling requirement of crawler task, arrives first lookup and its phase in invocation list
Same crawls requirement, then gets target according to mapping relations and crawls service, the target is finally called to crawl service.When above-mentioned
Include in crawler task it is multiple crawl when require, while being called.Then target crawler service to mesh is utilized
Mark website crawls data.
As described in above-mentioned steps S3, as by the data crawled storage into the first data memory module.Above-mentioned first deposits
Storage module is generally a document storage system, and relative low price, can save storage aspect opens money.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising:
S101, the task release module send the crawler task to the crawler module in the form of message queue.
As described in above-mentioned steps S101, message queue is a container, sends crawler task using the form of message queue,
Quickly lateral and distribution extension can be carried out when for large-scale crawler task, improve the processing capacity of crawler task.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory module
After step S3, which comprises
S4, the original crawler data in first memory module are cleaned using data cleansing module, is obtained clear
The first crawler data after washing, and the first crawler data are stored to preset second memory module.
As described in above-mentioned steps S4, the cleaning rule of above-mentioned data cleansing module includes a variety of, for example removes duplicate number
According to, incomplete data of removal etc., the data of needs can also be filtered out, repeated data etc. is then removed.Above-mentioned second
Memory module can be the subdata base being arranged in above-mentioned first memory module, for example be a text in the first memory module
Part folder etc..In a specific embodiment, above-mentioned second memory module is a number independently of above-mentioned first memory module
According to library, the cost of the second memory module is higher than above-mentioned first memory module, but more convenient to the management of data etc..Because
The data volume of original crawler data is larger, so the first memory module that use cost is low, the first crawler data number after cleaning
According to measure it is relatively fewer, so management easy to use, but higher cost the second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include:
S5, the day that other modules in the distributed reptile system architecture are obtained using the log and error handling module
Will data, and obtain the error log in the daily record data;
S6, the corresponding event of the error log is handled according to preset rules.
In the present embodiment, the method that above-mentioned distributed reptile crawls data is completed, above-mentioned distributed reptile is relied on
System architecture is realized, is executed above-mentioned the step of such as cleaning original crawler data, is crawled the step of data, can generate corresponding day
Will data, the application can get up these collection of log data, then utilize existing log analysis method, filter out each log
Then error log in data finds corresponding event according to error log and carries out corresponding automatic words processing, such as automatically
Repeat the step of generating error log etc..
In one embodiment, it is above-mentioned according to preset rules handle the corresponding event of the error log step S6 it
Afterwards, comprising:
S7, the error reporting that the corresponding event is generated using the log and error handling module, and by the mistake
Report is sent to preset mailbox.
As described in above-mentioned steps S7, as by error log, to result of the time-triggered protocol etc. according to preset requirement
Mail Contents are generated, then send mail in preset mailbox.Above-mentioned mailbox can be the mailbox of specified developer.It is above-mentioned
Mailbox can be multiple and different mailboxes, the corresponding developer of each mailbox, to facilitate developer to obtain wrong feelings in time
Condition.Further, receive the receipt that each mailbox is opened, as long as receiving a receipt, will with the receipt it is not corresponding its
Its withdrawing mail, after preventing multiple developers from seeing mail while handling identical problem.
In one embodiment, the above-mentioned step S7 that the corresponding event of the error log is handled according to preset rules, packet
It includes:
S71, judge whether the event is crawler failure using the log and error handling module;
If S72, the event are crawler failures, the corresponding crawler task of the event is issued again.
As described in above-mentioned steps S71 and S72, when crawler failure, mail notification, record can be carried out to developer in time
Lower error reason, and crawler task is rejoined in message queue by error handling logic, it is crawled again;It improves
The stability of process and the function of carrying out automation O&M.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include:
S8, judge whether to receive the incoming administration order of the Back Administration Module;
S9, if so, administration order described in priority processing.
In the present embodiment, above-mentioned Back Administration Module is monitored entire crawler system by way of management of webpage
With management.Start crawler process in such a way that Back Administration Module can be used and upload script and configuration;It can also be observed that
There is the crawler task of performance bottleneck, the scale of real-time extension crawler module;It can also be realized by Back Administration Module to being
The monitoring of all crawler tasks and data analysis etc. in system.
The method that the distributed reptile of the embodiment of the present application crawls data is based on above-mentioned distributed reptile system architecture, should
The mode that the design of framework is registered using HTTP service, different modules is isolated, and message is used between different modules
The mode of queue carries out mutual access.The coupling between system module can be reduced using this design scheme, and message team
The asynchronous message processing capacity of column can with the parallel ability of lifting system data processing, facilitate system when promoting processing capacity into
Row is extending transversely.Crawler service module is set, it is interior for storing crawler service, the bottom demand of entire crawler system is carried out
Encapsulation carries out modularization, and the processing of service reduces the workload of developer, and does not limit the exploitation language of developer
Speech reduces ability need;The stability and extended capability of crawler system are promoted by architecture design, and it is extensive to be suitable for multitask
Crawler system exploitation;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
Referring to Fig. 3, a kind of computer equipment is also provided in the embodiment of the present application, which can be above-mentioned pipe
It manages server or the corresponding server of management node, internal structure can be as shown in Figure 3.The computer equipment includes logical
Cross processor, memory, network interface and the database of system bus connection.Wherein, the processor of the Computer Design is used for
Calculating and control ability are provided.The memory of the computer equipment includes non-volatile memory medium, built-in storage.This is non-volatile
Property storage medium is stored with operating system, computer program and database.The internal memory is the behaviour in non-volatile memory medium
The operation for making system and computer program provides environment.The database of the computer equipment is used for distributed storage crawler system frame
The data such as each module of structure.The network interface of the computer equipment is used to communicate with external terminal by network connection.The meter
To realize a kind of method that distributed reptile crawls data when calculation machine program is executed by processor.
Above-mentioned processor executes the method that above-mentioned distributed reptile crawls data, based on the above embodiment in distribution climb
Worm system architecture, comprising: obtain crawler task using the task release module, and the crawler task is sent to described climb
Erpoglyph block, the crawler task include targeted website and crawl requirement;After the crawler module gets the crawler task, arrive
It is called in the crawler service module and requires corresponding target crawler to service with described crawl, and taken using the target crawler
Business, to the targeted website on crawl original crawler data, wherein be packaged in the crawler service module at least one with clothes
The crawler service of business form encapsulation;By the original crawler data storage crawled to preset first memory module.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising: described
Business release module sends the crawler task to the crawler module in the form of message queue.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory module
After step, which comprises carried out using data cleansing module to the original crawler data in first memory module
Cleaning, the first crawler data after being cleaned, and the first crawler data are stored to preset second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include: utilize the log and mistake
Processing module obtains the daily record data of other modules in the distributed reptile system architecture, and obtains in the daily record data
Error log;The corresponding event of the error log is handled according to preset rules.
In one embodiment, after the step of above-mentioned event corresponding according to the preset rules processing error log,
It include: the error reporting of the corresponding event to be generated using the log and error handling module, and the false alarm is accused
Give preset mailbox.
In one embodiment, the step of above-mentioned event corresponding according to the preset rules processing error log, comprising:
Judge whether the event is that crawler is failed using the log and error handling module;If the event is crawler failure,
The corresponding crawler task of the event is issued again.
In one embodiment, the method that above-mentioned distributed reptile crawls data, which is characterized in that the method is also wrapped
It includes: judging whether to receive the incoming administration order of the Back Administration Module;If so, administration order described in priority processing.
It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tied
The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.
The computer equipment of the embodiment of the present application, is based on above-mentioned distributed reptile system architecture, and the design of the framework uses
The mode of HTTP service registration, different modules is isolated, and is carried out between different modules using the mode of message queue
Mutual access.The coupling between system module can be reduced using this design scheme, and at the asynchronous message of message queue
Reason ability can facilitate system to carry out when promoting processing capacity extending transversely with the parallel ability of lifting system data processing.If
Crawler service module is set, it is interior for storing crawler service, the bottom demand of entire crawler system is packaged, module is carried out
Change, the processing of service reduces the workload of developer, and does not limit the development language of developer, and reducing ability needs
It asks;The stability and extended capability that crawler system is promoted by architecture design, are opened suitable for the large-scale crawler system of multitask
Hair;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculates
Machine program realizes a kind of method that above-mentioned distributed reptile crawls data when being executed by processor, based on the above embodiment in point
Cloth crawler system framework, comprising: obtain crawler task using the task release module, and the crawler task is sent to
The crawler module, the crawler task include targeted website and crawl requirement;The crawler module gets the crawler and appoints
After business, is called into the crawler service module and require corresponding target crawler to service with described crawl, and utilize the target
Crawler service, to the targeted website on crawl original crawler data, wherein be packaged at least one in the crawler service module
A crawler service with service form encapsulation;By the original crawler data storage crawled to preset first memory module.
The method that above-mentioned distributed reptile crawls data is based on above-mentioned distributed reptile system architecture, the design of the framework
The mode registered using HTTP service, different modules is isolated, and the mode of message queue is used between different modules
Carry out mutual access.The coupling between system module can be reduced using this design scheme, and the asynchronous of message queue disappears
Ceasing processing capacity can facilitate system to carry out lateral expansion when promoting processing capacity with the parallel ability of lifting system data processing
Exhibition.Crawler service module is set, it is interior to be serviced for storing crawler, the bottom demand of entire crawler system is packaged, into
Row modularization, the processing of service reduce the workload of developer, and do not limit the development language of developer, reduce
Ability need;The stability and extended capability of crawler system are promoted by architecture design, are suitable for the large-scale crawler of multitask
System development;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising: described
Business release module sends the crawler task to the crawler module in the form of message queue.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory module
After step, which comprises carried out using data cleansing module to the original crawler data in first memory module
Cleaning, the first crawler data after being cleaned, and the first crawler data are stored to preset second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include: utilize the log and mistake
Processing module obtains the daily record data of other modules in the distributed reptile system architecture, and obtains in the daily record data
Error log;The corresponding event of the error log is handled according to preset rules.
In one embodiment, after the step of above-mentioned event corresponding according to the preset rules processing error log,
It include: the error reporting of the corresponding event to be generated using the log and error handling module, and the false alarm is accused
Give preset mailbox.
In one embodiment, the step of above-mentioned event corresponding according to the preset rules processing error log, comprising:
Judge whether the event is that crawler is failed using the log and error handling module;If the event is crawler failure,
The corresponding crawler task of the event is issued again.
In one embodiment, the method that above-mentioned distributed reptile crawls data, which is characterized in that the method is also wrapped
It includes: judging whether to receive the incoming administration order of the Back Administration Module;If so, administration order described in priority processing.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
Any reference used in provided herein and embodiment to memory, storage, database or other media,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
The foregoing is merely preferred embodiment of the present application, are not intended to limit the scope of the patents of the application, all utilizations
Equivalent structure or equivalent flow shift made by present specification and accompanying drawing content is applied directly or indirectly in other correlations
Technical field, similarly include in the scope of patent protection of the application.
Claims (10)
1. a kind of distributed reptile system architecture, which is characterized in that the mode that the design of the framework is registered using HTTP service,
Different modules is isolated, carries out mutual access, the framework using the mode of message queue between different modules
Include:
Task release module, for issuing crawler task;
Crawler service module, for storing with crawler service different existing for service form, the different crawlers has been serviced
At different crawler tasks;
Crawler module, for receiving the crawler task of the task release module publication, and according to the crawler task, described in
Crawler service corresponding with the crawler task is called in crawler service module, is serviced using the crawler to targeted website and is carried out
Movement is crawled, corresponding original crawler data are obtained;
First data memory module, for storing the original crawler data.
2. a kind of method that distributed reptile crawls data is based on distributed reptile system architecture as described in claim 1,
It is characterized in that, comprising:
Crawler task is obtained using the task release module, and the crawler task is sent to the crawler module, it is described
Crawler task includes targeted website and crawls requirement;
After the crawler module gets the crawler task, is called into the crawler service module and crawl requirement pair with described
Answer target crawler service, and using the target crawler service, to the targeted website on crawl original crawler data,
In, at least one is packaged in the crawler service module with the crawler service of service form encapsulation;
By the original crawler data storage crawled to preset first memory module.
3. the method that distributed reptile according to claim 2 crawls data, which is characterized in that described to appoint the crawler
The step of business is sent to the crawler module, comprising:
The task release module sends the crawler task to the crawler module in the form of message queue.
4. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system
System framework further includes data cleansing module and the second memory module, and the original crawler data that will be crawled are stored to default
The first memory module the step of after, which comprises
The original crawler data in first memory module are cleaned using the data cleansing module, after obtaining cleaning
The first crawler data, and by the first crawler data storage to preset second memory module.
5. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system
System framework further includes log and error handling module, the method also includes:
The daily record data of other modules in the distributed reptile system architecture is obtained using the log and error handling module,
And obtain the error log in the daily record data;
The corresponding event of the error log is handled according to preset rules.
6. the method that distributed reptile according to claim 5 crawls data, which is characterized in that described according to preset rules
After the step of handling the error log corresponding event, comprising:
The error reporting of the corresponding event is generated using the log and error handling module, and the error reporting is sent
To preset mailbox.
7. the method that distributed reptile according to claim 5 crawls data, which is characterized in that described according to preset rules
The step of handling the error log corresponding event, comprising:
Judge whether the event is that crawler is failed using the log and error handling module;
If the event is crawler failure, the corresponding crawler task of the event is issued again.
8. the method that distributed reptile according to claim 2 crawls data, which is characterized in that the distributed reptile system
Framework of uniting further includes Back Administration Module, the method also includes:
Judge whether to receive the incoming administration order of the Back Administration Module;
If so, administration order described in priority processing.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists
In when the processor executes the computer program the step of any one of realization claim 2 to 8 the method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step of method described in any one of claim 2 to 8 is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910601110.6A CN110457556B (en) | 2019-07-04 | 2019-07-04 | Distributed crawler system architecture, method for crawling data and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910601110.6A CN110457556B (en) | 2019-07-04 | 2019-07-04 | Distributed crawler system architecture, method for crawling data and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110457556A true CN110457556A (en) | 2019-11-15 |
CN110457556B CN110457556B (en) | 2023-11-14 |
Family
ID=68482277
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910601110.6A Active CN110457556B (en) | 2019-07-04 | 2019-07-04 | Distributed crawler system architecture, method for crawling data and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110457556B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929127A (en) * | 2019-12-05 | 2020-03-27 | 广州市原象信息科技有限公司 | Method for analyzing Taobao live broadcast putting effect and computer equipment |
CN110929128A (en) * | 2019-12-11 | 2020-03-27 | 北京启迪区块链科技发展有限公司 | Data crawling method, device, equipment and medium |
CN111143336A (en) * | 2019-11-27 | 2020-05-12 | 三盟科技股份有限公司 | College scientific research data management-oriented web crawler management method and platform |
CN111192155A (en) * | 2019-12-25 | 2020-05-22 | 杭州龙席网络科技股份有限公司 | Social media inquiry plate identification and recommendation method based on SAAS |
CN111241366A (en) * | 2019-12-25 | 2020-06-05 | 杭州龙席网络科技股份有限公司 | Client social media monitoring method based on SAAS |
CN111241373A (en) * | 2020-02-20 | 2020-06-05 | 山东爱城市网信息技术有限公司 | Webpage crawler system based on micro-service and implementation method |
CN111708931A (en) * | 2020-06-06 | 2020-09-25 | 谢国柱 | Big data acquisition method based on mobile internet and artificial intelligence cloud service platform |
CN112597367A (en) * | 2020-11-30 | 2021-04-02 | 国网北京市电力公司 | Data information fusion system and target decision generation method |
CN112650908A (en) * | 2020-12-25 | 2021-04-13 | 百果园技术(新加坡)有限公司 | Data processing method, system and device based on network theme crawler |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100269168A1 (en) * | 2009-04-21 | 2010-10-21 | Brightcloud Inc. | System And Method For Developing A Risk Profile For An Internet Service |
US20110208848A1 (en) * | 2008-08-05 | 2011-08-25 | Zhiyong Feng | Network system of web services based on semantics and relationships |
CN102932448A (en) * | 2012-10-30 | 2013-02-13 | 工业和信息化部电信传输研究所 | Distributed network crawler URL (uniform resource locator) duplicate removal system and method |
CN105243159A (en) * | 2015-10-28 | 2016-01-13 | 福建亿榕信息技术有限公司 | Visual script editor-based distributed web crawler system |
CN105447088A (en) * | 2015-11-06 | 2016-03-30 | 杭州掘数科技有限公司 | Volunteer computing based multi-tenant professional cloud crawler |
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
CN106484886A (en) * | 2016-10-17 | 2017-03-08 | 金蝶软件(中国)有限公司 | A kind of method of data acquisition and its relevant device |
CN106874487A (en) * | 2017-02-21 | 2017-06-20 | 国信优易数据有限公司 | A kind of distributed reptile management system and its method |
CN107135092A (en) * | 2017-03-15 | 2017-09-05 | 浙江工业大学 | A kind of Web service clustering method towards global social interaction server net |
CN107943991A (en) * | 2017-12-01 | 2018-04-20 | 成都嗨翻屋文化传播有限公司 | A kind of distributed reptile frame and implementation method based on memory database |
CN108170551A (en) * | 2018-01-03 | 2018-06-15 | 深圳壹账通智能科技有限公司 | Front and back end error handling method, server and storage medium based on crawler system |
CN109492149A (en) * | 2018-11-29 | 2019-03-19 | 深圳墨世科技有限公司 | Crawler task processing method and device |
CN109508422A (en) * | 2018-12-05 | 2019-03-22 | 南京邮电大学 | The height of multithreading intelligent scheduling is hidden crawler system |
CN109815384A (en) * | 2019-01-29 | 2019-05-28 | 携程旅游信息技术(上海)有限公司 | Method, system, equipment and the storage medium that crawler is realized |
-
2019
- 2019-07-04 CN CN201910601110.6A patent/CN110457556B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110208848A1 (en) * | 2008-08-05 | 2011-08-25 | Zhiyong Feng | Network system of web services based on semantics and relationships |
US20100269168A1 (en) * | 2009-04-21 | 2010-10-21 | Brightcloud Inc. | System And Method For Developing A Risk Profile For An Internet Service |
CN102932448A (en) * | 2012-10-30 | 2013-02-13 | 工业和信息化部电信传输研究所 | Distributed network crawler URL (uniform resource locator) duplicate removal system and method |
CN105243159A (en) * | 2015-10-28 | 2016-01-13 | 福建亿榕信息技术有限公司 | Visual script editor-based distributed web crawler system |
CN105447088A (en) * | 2015-11-06 | 2016-03-30 | 杭州掘数科技有限公司 | Volunteer computing based multi-tenant professional cloud crawler |
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
CN106484886A (en) * | 2016-10-17 | 2017-03-08 | 金蝶软件(中国)有限公司 | A kind of method of data acquisition and its relevant device |
CN106874487A (en) * | 2017-02-21 | 2017-06-20 | 国信优易数据有限公司 | A kind of distributed reptile management system and its method |
CN107135092A (en) * | 2017-03-15 | 2017-09-05 | 浙江工业大学 | A kind of Web service clustering method towards global social interaction server net |
CN107943991A (en) * | 2017-12-01 | 2018-04-20 | 成都嗨翻屋文化传播有限公司 | A kind of distributed reptile frame and implementation method based on memory database |
CN108170551A (en) * | 2018-01-03 | 2018-06-15 | 深圳壹账通智能科技有限公司 | Front and back end error handling method, server and storage medium based on crawler system |
CN109492149A (en) * | 2018-11-29 | 2019-03-19 | 深圳墨世科技有限公司 | Crawler task processing method and device |
CN109508422A (en) * | 2018-12-05 | 2019-03-22 | 南京邮电大学 | The height of multithreading intelligent scheduling is hidden crawler system |
CN109815384A (en) * | 2019-01-29 | 2019-05-28 | 携程旅游信息技术(上海)有限公司 | Method, system, equipment and the storage medium that crawler is realized |
Non-Patent Citations (1)
Title |
---|
董禹龙等: "主动获取式的分布式网络爬虫集群方法研究", 《计算机科学》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143336A (en) * | 2019-11-27 | 2020-05-12 | 三盟科技股份有限公司 | College scientific research data management-oriented web crawler management method and platform |
CN110929127A (en) * | 2019-12-05 | 2020-03-27 | 广州市原象信息科技有限公司 | Method for analyzing Taobao live broadcast putting effect and computer equipment |
CN110929128A (en) * | 2019-12-11 | 2020-03-27 | 北京启迪区块链科技发展有限公司 | Data crawling method, device, equipment and medium |
CN111192155A (en) * | 2019-12-25 | 2020-05-22 | 杭州龙席网络科技股份有限公司 | Social media inquiry plate identification and recommendation method based on SAAS |
CN111241366A (en) * | 2019-12-25 | 2020-06-05 | 杭州龙席网络科技股份有限公司 | Client social media monitoring method based on SAAS |
CN111241373A (en) * | 2020-02-20 | 2020-06-05 | 山东爱城市网信息技术有限公司 | Webpage crawler system based on micro-service and implementation method |
CN111708931A (en) * | 2020-06-06 | 2020-09-25 | 谢国柱 | Big data acquisition method based on mobile internet and artificial intelligence cloud service platform |
CN112597367A (en) * | 2020-11-30 | 2021-04-02 | 国网北京市电力公司 | Data information fusion system and target decision generation method |
CN112650908A (en) * | 2020-12-25 | 2021-04-13 | 百果园技术(新加坡)有限公司 | Data processing method, system and device based on network theme crawler |
Also Published As
Publication number | Publication date |
---|---|
CN110457556B (en) | 2023-11-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110457556A (en) | Distributed reptile system architecture, the method and computer equipment for crawling data | |
CN105243159B (en) | A kind of distributed network crawler system based on visualization script editing machine | |
Jain et al. | Cloud to edge: distributed deployment of process-aware IoT applications | |
US10956013B2 (en) | User interface for automated flows within a cloud based developmental platform | |
US10147066B2 (en) | Business process framework | |
CN111404759A (en) | Service detection method, rule configuration method, related device and medium | |
CN108810025A (en) | A kind of security assessment method of darknet, server and computer-readable medium | |
CN102710793A (en) | Network printing system based on cloud computing and data storage method thereof | |
Gupta et al. | A QoS-supported approach using fault detection and tolerance for achieving reliability in dynamic orchestration of web services | |
CN111143167A (en) | Alarm merging method, device, equipment and storage medium for multiple platforms | |
CN116048467A (en) | Micro-service development platform and business system development method | |
CN102508773A (en) | Method and device for monitoring WEB service system simulation based on Internet explorer (IE) kernel | |
CN112738138A (en) | Cloud security hosting method, device, equipment and storage medium | |
CN106202399A (en) | Method for implementing data management system of big data | |
CN107995062B (en) | RPC-based traffic management integrated platform remote service real-time processing method and system | |
CN103118248B (en) | Monitoring method, monitoring agent, monitoring server and system | |
Prist et al. | Cyber-physical manufacturing systems: An architecture for sensor integration, production line simulation and cloud services | |
CN105262845B (en) | A kind of document transmission processing method and system | |
CN102694676B (en) | Management system, management equipment and management method | |
CN116136801B (en) | Cloud platform data processing method and device, electronic equipment and storage medium | |
Platenius-Mohr et al. | An analysis of use cases for the asset administration shell in the context of edge computing | |
CN106886453A (en) | Information processing method, device and system for asynchronous multiple tracks | |
CN111447273A (en) | Cloud processing system and data processing method based on cloud processing system | |
Ritter et al. | Modeling exception flows in integration systems | |
CN110333930A (en) | Digital Platform system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240326 Address after: Room 101-1, Building 2, No. 95, Daguan Middle Road, Tianhe District, Guangzhou, Guangdong 510000 (office only) Patentee after: Guangzhou Zhongtian Technology Consulting Co.,Ltd. Country or region after: China Address before: 400010 38 / F, 39 / F, unit 1, 99 Wuyi Road, Yuzhong District, Chongqing Patentee before: CHONGQING FINANCIAL ASSETS EXCHANGE Co.,Ltd. Country or region before: China |
|
TR01 | Transfer of patent right |