CN105243159B

CN105243159B - A kind of distributed network crawler system based on visualization script editing machine

Info

Publication number: CN105243159B
Application number: CN201510713985.7A
Authority: CN
Inventors: 倪时龙; 苏江文; 王秋琳; 陈予言
Original assignee: Fujian Yirong Information Technology Co Ltd
Current assignee: Fujian Yirong Information Technology Co Ltd
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2019-06-25
Anticipated expiration: 2035-10-28
Also published as: CN105243159A

Abstract

The present invention provides a kind of distributed network crawler system based on visualization script editing machine, comprising: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module；It is inputted according to user by visualization interface, system automatically generated meta-data extraction script, it can identify the structure of targeted sites, efficiently grab specific data, assigned tasks are created by task scheduling modules, webpage capture module is responsible for grabbing the page, and it is metadata set that content processing module, which transfers corresponding script for conversion of page, it is finally uniformly processed, is stored by result memory module.The present invention can greatly improve the efficiency that crawls for particular station data, reduce user's labor intensity, save system resource, and possess good scalability and retractility, be suitable for all types of internet sites.

Description

A kind of distributed network crawler system based on visualization script editing machine

Technical field

The present invention relates to technical field of network communication more particularly to a kind of distributed networks based on visualization script editing machine Network crawler system.

Background technique

It is born from internet at the end of the 20th century, internet information has obtained explosively increasing, already huge as one , it is widely distributed, high isomerism, it is semi-structured, and the information Librarian that dynamic is high.In order to from internet information It collects and extracts the interested data of people, web crawlers is with regard to this birth.Since then, crawler technology just gets out of hand, and is with it Foundation stone has expedited the emergence of the search engine giant both domestic and external such as Baidu, Google, and the window of a fan information is opened to common people.

Now, internet information is mainly provided by website and WEB service form.Website is by miscellaneous group of web At the data provided are substantially with hypertext markup language (HTML, the Hypertext Markup of non-structured static state Language presentation).Since information analysis system can not directly use HTML, generally require to carry out it secondary treatment ability Extract useful information.WEB service is then the data-interface of opposite specification, can obtain data by special parameter access, WEB service can be individually present, can also be in conjunction with website.How efficiently and accurately from a large amount of specific websites or WEB service It is more and more of interest by people to obtain specific information.It is huge that this is encountered by the web crawlers technology for being responsible for network information gathering Big challenge.

Although web crawlers undergoes more Dai Fazhan, the multiple systems model basically formed.Crawler is set both at home and abroad Meter has had very mature solution, and has come into operation, but those solutions are only provided to public users mostly A kind of general service can not be formulated for particular station specific data, can not consider that each user's is various each The demand of sample.

In internet area, there are several types of the designs of the crawler of mainstream at present:

1. traditional crawler system

Traditional crawler system needs the software programmers of profession to pass through the Web Organization form of analysis targeted sites, number According to Javascript logical code on interface and the page, corresponding program code or script are write out, to realize according to certain Rule-based filtering goes out specific data.It is obvious that needed for the advantages of this method, can accurately extract from targeted sites Data.

But this method has very big defect, it is general only just to be adopted in the case where targeted sites quantity is extremely limited With.The reason is that the HMTL language that internet site uses writes specification there is no fixation, need to write all targeted sites Corresponding script, along with having current more and more websites using dynamically load mode, writing difficulty is greatly improved.Work as monitoring When the correcting of website, need to adjust script in time, and redeploy crawler.This greatly improves the manpower in development and maintenance Cost.In addition to this, this mode causes scalability bad with retractility due to its complexity, is unfavorable for large-scale distributed Deployment.

2. universal distributed crawler system

Universal distributed crawler system, primary structure are scheduling (control), crawl and the grouping of the big basic courses department of contents processing three At.Most current internet search engine is all this mode.Such as: disclosing in the prior art " relevant point of theme a kind of Cloth network crawler system, ", see Publication No.: CN102646129A, publication date are as follows: the Chinese patent of 2012-08-22, this is System includes: topic links memory, and the hyperlink of crawl is not completed for storage system；Control node is used for from topic links Hyperlink is extracted in memory, removes the hyperlink wherein crossed by system grabs, is not then crossed by system grabs super Node of creeping is distributed in link, and controls whether that termination system is run；It creeps node, for receiving the hyperlink of control node distribution It connects, then downloads the webpage of hyperlink mark, and by web storage in web database；Web database, for storing The webpage of node of creeping crawl；Page analyzer, for periodically reading the newest net for node downloading of creeping from web database Page carries out content analysis to webpage, calculates the topic correlativity of contained hyperlink in webpage and webpage, then according to theme correlation Relevant hyperlink is stored in topic links memory by degree, and the topic correlativity of each webpage is stored in web database In.The invention is exactly using such mode.Such crawler system has mainly focused on url filtering and Web page subject Analysis on, contents processing part is substantially using textual analysis extraction module.

Textual analysis module can substantially be divided into: the 1. text extraction algorithms 2. based on label applications are sentenced based on label densities The text extracting of the fixed 3. 4. view-based access control model web page blocks analytical technologies of Web page text extracting method based on machine learning.But whether Using which kind of algorithm, it is only used for the extraction of the trunks data such as Web page text and does not can guarantee the accuracy for extracting data. These inventive methods can preferably be used for distributed reptile system, but be confined to the algorithm relied on, be only applicable to laterally big model The fuzzy data enclosed crawls, for crawling with birth defect for specific data.Because it is in order to obtain maximum versatility, sacrificial Domestic animal customization ability, text message can only be extracted from webpage, certain types of metadata can not be but isolated from text. The citing such as commodity price in the electric business website page, the drug specifications in the network pharmacy page.Secondly, most of textual analysis is calculated Method is relative complex, and the script that comparison customizes when largely using can consume more system resources, causes under crawler system performance Drop.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of distributed network based on visualization script editing machine and climbs Worm system, can be realized efficiently customize to a large amount of particular stations and crawls while compatible universal website crawls, and solve Defect of the existing technology；User's labor intensity is reduced, system resource is saved.

The present invention is implemented as follows: a kind of distributed network crawler system based on visualization script editing machine, described System includes: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, contents processing mould Block and result memory module；

The visualization script editing machine, for checking targeted website, and selection target website data capture area；It will use The input at family is converted to an execution chain, while generating corresponding script and one database of deposit according to chain is executed；The script is For the corresponding script in targeted website；

The Distributed Message Queue is used for task scheduling modules, webpage capture module, content processing module and result Memory module is decoupled, which includes scheduling queue, crawl queue, processing queue and result queue；

The task scheduling modules, for be responsible for coordinate whole system running, read in targeted website starting URL link with User inputs information package at being passed to the scheduling queue after task, and task object is obtained in scheduling queue, and filters weight The crawl queue is sent to after multiple task；

The webpage capture module automatically parses website coding, and will grab for getting URL link from crawl queue The Content Transformation of the website taken is encoded at UTF-8, is forwarded and is sent to after the content which encodes is packaged with website relevant information Handle queue；

The content processing module uses visualization foot for getting the web page contents of website from the processing queue The URL matching rule that this editing machine generates matches the URL link of this webpage, calls this URL matching rule pair if finding matching The script answered parses this web page contents；Result after parsing is passed in result queue；

The result memory module, for taking out result data from result queue, and result data is pre- according to system The configuration of definition carries out that screening is uniformly processed, and is then stored in database.

Further, the system also includes monitoring module, the monitoring module is monitored in real time in Distributed Message Queue Scheduling queue, grab queue, handle queue, whether four queues of result queue malfunction, and when an abnormality is discovered, push disappears in time It ceases to the user interface of system, reminds user to check error reason and whether re-start script input.

Further, the system also includes text extraction module, when webpage domain name matching less than with the database In script when, call the text extraction module, to extract the corresponding script of webpage, the text extraction module is used The text extracting mode of view-based access control model web page blocks analytical technology extracts.

Further, described this script of calling is parsed；If what is generated after dissection process is new URL link, New URL link is passed to the scheduling queue, re-executes task scheduling modules；If after dissection process being result data, Then the result data after parsing is passed in result queue.

Further, the task scheduling modules include url filtering module and rate manager, the url filtering module, Duplicate removal is carried out to URL link using Bloom filter, prevents from repeating to crawl same URL link, Bloom filter is by one two A series of random mapping function compositions of system vector sum, for retrieving an element whether in a set；The rate pipe Device is managed, network congestion is prevented using token bucket algorithm, the flow of limitation outflow network makes flow with uniform speed to outgoing It send, guarantees the stability of system.

Further, the webpage capture module includes: proxy access module and browser analog module, and the agency visits It asks module, is accessed to specified URL link using preset IP agency according to user configuration information, prevent the webpage from grabbing Server ip where modulus block is blocked because amount of access is excessive by targeted website, and the browser analog module uses WebKit Open source browser engine parses targeted website, and the Javascript code being able to carry out on the page generates the complete of targeted website The whole page.

Further, the execution chain includes several subparameters, and there are many selection, the selection packets of subparameter for subparameter It includes: lower layer's URL link selection rule, the scripted code that metadata selected mark or system can be performed.

Further, the visualization script editing machine specific implementation flow is as follows:

Step 1 inputs targeted website URL link address in visualization script editor interface；

Targeted website URL link web page contents are presented in step 2, visualization script editing machine,

Step 3, if you do not need to the lower layer into this webpage URL link, then enter step 5, if necessary to enter under Layer URL link then enters step 4；

Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and deposit Enter an execution chain, all location informations are formed in the form of CSS or XPATH grammer, return step 3；

Step 5 selects several to need to grab the block of content, and deposit one executes chain,

Step 6, user confirm that editing is completed,

The execution chain recorded is passed to script generator by step 7, visualization script editing machine, produces corresponding target network The crawl script stood, while being directed to advanced level user, provides additional interface, and user can be by writing the code of compatible system, directly It is embedded among the crawl script；

Script is stored in database by step 8.

The present invention has the advantage that the visualization script editing machine of this system, can be such that unprofessional user intuitively selects The operation of user is converted into automatically specifically to handle script, by crawler by the capture area for selecting targeted sites related data Each distributed processors unit in system in operation preferentially executes these processing script dynamics, greatly reduces customization Human cost needed for changing crawler, while improving the operational efficiency of crawler system.The accuracy rate of the system grabs data is high, and With high scalability and retractility.

Detailed description of the invention

Fig. 1 is the structural schematic diagram of present system.

Fig. 2 is the work flow diagram of present system.

Fig. 3 is that visualization script editing machine of the present invention executes structural schematic diagram.

Fig. 4 is visualization script editing machine workflow schematic diagram of the present invention.

Fig. 5 is the flow chart that the present invention executes chain function mode.

Fig. 6 is the work flow diagram of the content of present invention processing module and script.

Fig. 7 is the structural schematic diagram of one embodiment of present system.

Specific embodiment

It please refers to shown in Fig. 1 to Fig. 7, a kind of distributed network crawler system based on visualization script editing machine of the invention System, the system comprises: visualization script editing machines, Distributed Message Queue, task scheduling modules, webpage capture module, interior Hold processing module and result memory module；

The visualization script editing machine, is used for visual check targeted website content, and the data of selection target website are grabbed Take region；Since it by the input of user (inputting targeted sites URL link, to all users behaviour being finally completed in editor Make the input generated) an execution chain is converted to other inessential parameters (for example whether extracting using text, if simulation is clear Look at device etc.), while generating corresponding script according to chain is executed and be stored in a database；The visualization script editing machine to use Family can check targeted website as normal browsing webpage without having programming skill.The script is that targeted website is corresponding Script；

Configuration management module provides WEB interface, and user can configure the website for needing to crawl herein, and be directed to one A or a series of website configuration schedules strategies (such as: priority periodically crawls, and climbs interval etc. again), crawl strategy (error weight Examination, enabling agency enable visit device simulation etc.) and other configurations parameter, form user configuration information.

The Distributed Message Queue is used for task scheduling modules, webpage capture module, content processing module and result Memory module is decoupled, and high distributed deployment ability is realized.The Distributed Message Queue includes scheduling queue, crawl Queue, processing queue and result queue；

The task scheduling modules coordinate the running of whole system for being responsible for, and reading in targeted website, (targeted website is For the website that carry out processing judgement) starting URL link and user input information package at being passed to the scheduling queue after task, And task object is obtained in scheduling queue, and be sent to the crawl queue after filtering iterative task；The task schedule mould Block includes url filtering module and rate manager, and the url filtering module goes URL link using Bloom filter Weight prevents from repeating to crawl same URL link, Bloom filter be actually by a very long binary vector and it is a series of with Machine mapping function composition, for retrieving an element whether in a set；Its advantages are space efficiency and query time All considerably beyond general algorithm, the disadvantage is that having certain false recognition rate and deleting difficult.It can be greatly using Bloom filter Improve system effectiveness, and it the shortcomings that crawler system will not be had an impact completely, be very suitable for crawler system use.The speed Rate manager prevents network congestion using token bucket algorithm, and the flow of limitation outflow network keeps flow outside with uniform speed It sends, guarantees the stability of system.

The webpage capture module automatically parses website coding, and will grab for getting URL link from crawl queue The Content Transformation of the website taken is encoded at UTF-8, is forwarded and is sent to after the content which encodes is packaged with website relevant information Handle queue；The webpage capture module includes: proxy access module and browser analog module, the proxy access module, With the development of network technology, nowadays more and more websites use dynamic page technology, have used a large amount of Javascript Script generates web page contents, and the webpage capture of traditional mode can only obtain the source code of the page, can not execute Javascript script leads to not the complete page for obtaining targeted sites, the difficult multiplication that data are extracted.Agency of the invention Access modules can be accessed to specified URL link using preset IP agency according to user configuration information, and the net is prevented Server ip where page handling module is blocked because amount of access is excessive by targeted website, and the browser analog module uses WebKit increases income browser engine to parse targeted website, and the Javascript code being able to carry out on the page generates target network The complete page stood.

The content processing module, for getting the web page contents of website from the processing queue, if the URL of this webpage Link matches with URL matching rule predetermined, and (user is in advance in the targeted sites URL of visualization script editing machine input Link intelligently generates a URL matching rule according to the condition of user setting by visual editor), then it calls and matches this URL The script of link parses the web page contents of website；Result after parsing is passed in result queue；The result storage Module carries out at unified for taking out result data from result queue, and by result data according to the predefined configuration of system Reason screening, is then stored in database.

Wherein, the system also includes monitoring module and text extraction module, the monitoring module real time monitoring is distributed Scheduling queue in message queue grabs queue, handles queue, and whether four queues of result queue malfunction, when an abnormality is discovered, Timely PUSH message reminds user to check error reason and whether re-starts script input to the user interface of system.

When the matching of the domain name of webpage is less than with script in the database, the text extraction module is called, into Row extracts the corresponding script of webpage, and the text extraction module uses the text extracting mode of view-based access control model web page blocks analytical technology It extracts.

In the present invention, described this script of calling is parsed；If what is generated after dissection process is new URL link, New URL link is then passed to the scheduling queue, re-executes task scheduling modules；If after dissection process being number of results According to then result data after parsing is passed in result queue.

The execution chain includes several subparameters, and for subparameter there are many selection, the selection of subparameter includes: lower layer URL Link selection rule, the scripted code that metadata selected identifies (format such as CSS, XPATH selector) or system can be performed.

As shown in Fig. 3,4,5, the visualization script editing machine specific implementation flow is as follows:

Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and deposit Enter an execution chain, (visualization script editing machine will record the position of these blocks, and one execution chain concrete operations of deposit can be joined See Fig. 5) all location informations form in the form of CSS or XPATH grammer, return step 3；

Step 6, user confirm that editing is completed,

Script is stored in database by step 8.

It is specific as follows such as the work flow diagram that Fig. 2 is present system:

(1) task scheduling modules access configuration management module, read in starting URL link and user configuration information is packaged into and appoints Scheduling queue is passed to after business.

(2) task scheduling modules obtain task object in scheduling queue, url filtering module are inquired, if do not accessed The URL link of this task is crossed, then is sent directly to crawl queue.If accessed, the parameter for detecting user setting (is paid a return visit Time etc.), if allowing to access again, it is also sent to crawl queue, otherwise abandons this task.Finally by the task after filter weight It is sent to crawl queue.

(3) webpage capture module gets URL link from crawl queue, executes grasping manipulation, automatically parses website coding, And it changes the content of crawl into general UTF-8 coding and is sent to processing queue with the packing forwarding of website relevant information.

(4) content processing module gets web page contents from processing queue.If the information matches such as the domain name of this webpage arrive The script (script i.e. in database) that user pre-defines, then call this script to be parsed.If generated after processing Be new URL link, these link by incoming scheduling queue, reenter step (2) if it is result data, be then passed to Result queue.

(5) result memory module is taken out from result queue as a result, do according to predetermined configuration, do it is final it is unified at Reason, is restored again into database.

(6) (2)~(5) are repeated to cease and desist order until receiving system.

Such as the structural schematic diagram that Fig. 7 is one embodiment of present system.The modules of the invention can be with single machine The more way of example deployment of more examples, multimachine list example, multimachine.System i.e. of the invention can be with distributed deployment.

In addition, the transmitted data object to message queue of this system is collectively referred to as task object, a task object packet Contain: 1. content (URL link, web page contents or result data etc. change according to the difference of message queue)；2. configuration ginseng Number；3. status indicator；

It is all first to take out task object, then relevant information is taken out from task object actually from message queue.

What needs to be explained here is that: the present invention in task scheduling modules, webpage capture module, content processing module and As a result memory module can start multiple examples on multiple servers, they are realized by message queue and are decoupled, Ke Yisui When stop or increase any type of example.Such design can be in the scalability and retractility of maximum lifting system.

In short, the present invention is inputted according to user by visualization interface, system automatically generated meta-data extraction script, It can identify the structure of targeted sites, efficiently grab specific data, create assigned tasks by task scheduling modules, webpage is grabbed Modulus block is responsible for grabbing the page, and it is metadata set that content processing module, which transfers corresponding script for conversion of page, is finally uniformly processed, It is stored by result memory module.The present invention can greatly improve the efficiency that crawls for particular station data, reduce and use Family labor intensity saves system resource, and possesses good scalability and retractility, is suitable for all types of internet sites Point.

The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with Modification, is all covered by the present invention.

Claims

1. a kind of distributed network crawler system based on visualization script editing machine, it is characterised in that: the system comprises: it can Depending on changing script-editor, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result Memory module；

The visualization script editing machine, for checking targeted website, and selection target website data capture area；By user's Input is converted to an execution chain, while generating corresponding script and one database of deposit according to chain is executed；The script is mesh Mark the corresponding script in website；

The Distributed Message Queue, for storing task scheduling modules, webpage capture module, content processing module and result Module is decoupled, which includes scheduling queue, crawl queue, processing queue and result queue；

The task scheduling modules coordinate the running of whole system for being responsible for, read in targeted website starting URL link and user Input information package obtains task object at being passed to the scheduling queue after task in scheduling queue, and filters repetition and appoint The crawl queue is sent to after business；

The webpage capture module automatically parses website coding, and by crawl for getting URL link from crawl queue The Content Transformation of website is encoded at UTF-8, and forwarding is sent to processing after content and website relevant information which encodes are packaged Queue；

The content processing module, for getting the web page contents of website from the processing queue, if the URL link of this webpage Match with URL matching rule predetermined, then calls and match this corresponding script of URL matching rule in the webpage of website Appearance is parsed；Result after parsing is passed in result queue；

The result memory module is predefined for taking out result data from result queue, and by result data according to system Configuration carry out that screening is uniformly processed, be then stored in database；

The system also includes monitoring module, the monitoring module monitors the scheduling queue in Distributed Message Queue in real time, grabs Queue is taken, queue is handled, whether four queues of result queue malfunction, when an abnormality is discovered, the use of timely PUSH message to system Family interface reminds user to check error reason and whether re-starts script input；

The system also includes text extraction module, when the matching of the domain name of webpage is less than with script in the database, adjust With the text extraction module, to extract the corresponding script of webpage, the text extraction module uses view-based access control model webpage The text extracting mode of block analysis technology extracts.

2. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature It is: if what is generated after dissection process is new URL link, new URL link is passed to the scheduling queue, is held again Row task scheduling modules；If after dissection process being result data, the result data after parsing is passed in result queue.

3. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature Be: the task scheduling modules include url filtering module and rate manager, and the url filtering module uses the grand filtering of cloth Device carries out duplicate removal to URL link, prevents from repeating to crawl same URL link, Bloom filter is by a binary vector and one Serial random mapping function composition, for retrieving an element whether in a set；The rate manager, using token Bucket algorithm prevents network congestion, and the flow of limitation outflow network is sent out flow with uniform speed, guarantees the steady of system It is qualitative.

4. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature Be: the webpage capture module includes: proxy access module and browser analog module, the proxy access module, according to User configuration information accesses to specified URL link using preset IP agency, prevents the webpage capture module place Server ip is blocked because amount of access is excessive by targeted website, the browser analog module, uses WebKit open source browser Engine parses targeted website, and the Javascript code being able to carry out on the page generates the complete page of targeted website.

5. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature Be: the execution chain includes several subparameters, and for subparameter there are many selection, the selection of subparameter includes: lower layer's URL link Selection rule, the scripted code that metadata selected mark or system can be performed.

6. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature Be: the visualization script editing machine specific implementation flow is as follows:

Step 3, if you do not need to the lower layer into this webpage URL link, then enter step 5, if necessary to enter lower layer URL Link then enters step 4；

Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and be stored in one Chain is executed, all location informations are formed in the form of CSS or XPATH grammer, return step 3；

Step 6, user confirm that editing is completed,

The execution chain recorded is passed to script generator by step 7, visualization script editing machine, produces corresponding targeted website Script is grabbed, while being directed to advanced level user, additional interface is provided, user can be directly embedded by writing the code of compatible system Among the crawl script；

Script is stored in database by step 8.