CN105243159B - A kind of distributed network crawler system based on visualization script editing machine - Google Patents

A kind of distributed network crawler system based on visualization script editing machine Download PDF

Info

Publication number
CN105243159B
CN105243159B CN201510713985.7A CN201510713985A CN105243159B CN 105243159 B CN105243159 B CN 105243159B CN 201510713985 A CN201510713985 A CN 201510713985A CN 105243159 B CN105243159 B CN 105243159B
Authority
CN
China
Prior art keywords
queue
script
module
editing machine
url link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510713985.7A
Other languages
Chinese (zh)
Other versions
CN105243159A (en
Inventor
倪时龙
苏江文
王秋琳
陈予言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Yirong Information Technology Co Ltd
Original Assignee
Fujian Yirong Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Yirong Information Technology Co Ltd filed Critical Fujian Yirong Information Technology Co Ltd
Priority to CN201510713985.7A priority Critical patent/CN105243159B/en
Publication of CN105243159A publication Critical patent/CN105243159A/en
Application granted granted Critical
Publication of CN105243159B publication Critical patent/CN105243159B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of distributed network crawler system based on visualization script editing machine, comprising: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module;It is inputted according to user by visualization interface, system automatically generated meta-data extraction script, it can identify the structure of targeted sites, efficiently grab specific data, assigned tasks are created by task scheduling modules, webpage capture module is responsible for grabbing the page, and it is metadata set that content processing module, which transfers corresponding script for conversion of page, it is finally uniformly processed, is stored by result memory module.The present invention can greatly improve the efficiency that crawls for particular station data, reduce user's labor intensity, save system resource, and possess good scalability and retractility, be suitable for all types of internet sites.

Description

A kind of distributed network crawler system based on visualization script editing machine
Technical field
The present invention relates to technical field of network communication more particularly to a kind of distributed networks based on visualization script editing machine Network crawler system.
Background technique
It is born from internet at the end of the 20th century, internet information has obtained explosively increasing, already huge as one , it is widely distributed, high isomerism, it is semi-structured, and the information Librarian that dynamic is high.In order to from internet information It collects and extracts the interested data of people, web crawlers is with regard to this birth.Since then, crawler technology just gets out of hand, and is with it Foundation stone has expedited the emergence of the search engine giant both domestic and external such as Baidu, Google, and the window of a fan information is opened to common people.
Now, internet information is mainly provided by website and WEB service form.Website is by miscellaneous group of web At the data provided are substantially with hypertext markup language (HTML, the Hypertext Markup of non-structured static state Language presentation).Since information analysis system can not directly use HTML, generally require to carry out it secondary treatment ability Extract useful information.WEB service is then the data-interface of opposite specification, can obtain data by special parameter access, WEB service can be individually present, can also be in conjunction with website.How efficiently and accurately from a large amount of specific websites or WEB service It is more and more of interest by people to obtain specific information.It is huge that this is encountered by the web crawlers technology for being responsible for network information gathering Big challenge.
Although web crawlers undergoes more Dai Fazhan, the multiple systems model basically formed.Crawler is set both at home and abroad Meter has had very mature solution, and has come into operation, but those solutions are only provided to public users mostly A kind of general service can not be formulated for particular station specific data, can not consider that each user's is various each The demand of sample.
In internet area, there are several types of the designs of the crawler of mainstream at present:
1. traditional crawler system
Traditional crawler system needs the software programmers of profession to pass through the Web Organization form of analysis targeted sites, number According to Javascript logical code on interface and the page, corresponding program code or script are write out, to realize according to certain Rule-based filtering goes out specific data.It is obvious that needed for the advantages of this method, can accurately extract from targeted sites Data.
But this method has very big defect, it is general only just to be adopted in the case where targeted sites quantity is extremely limited With.The reason is that the HMTL language that internet site uses writes specification there is no fixation, need to write all targeted sites Corresponding script, along with having current more and more websites using dynamically load mode, writing difficulty is greatly improved.Work as monitoring When the correcting of website, need to adjust script in time, and redeploy crawler.This greatly improves the manpower in development and maintenance Cost.In addition to this, this mode causes scalability bad with retractility due to its complexity, is unfavorable for large-scale distributed Deployment.
2. universal distributed crawler system
Universal distributed crawler system, primary structure are scheduling (control), crawl and the grouping of the big basic courses department of contents processing three At.Most current internet search engine is all this mode.Such as: disclosing in the prior art " relevant point of theme a kind of Cloth network crawler system, ", see Publication No.: CN102646129A, publication date are as follows: the Chinese patent of 2012-08-22, this is System includes: topic links memory, and the hyperlink of crawl is not completed for storage system;Control node is used for from topic links Hyperlink is extracted in memory, removes the hyperlink wherein crossed by system grabs, is not then crossed by system grabs super Node of creeping is distributed in link, and controls whether that termination system is run;It creeps node, for receiving the hyperlink of control node distribution It connects, then downloads the webpage of hyperlink mark, and by web storage in web database;Web database, for storing The webpage of node of creeping crawl;Page analyzer, for periodically reading the newest net for node downloading of creeping from web database Page carries out content analysis to webpage, calculates the topic correlativity of contained hyperlink in webpage and webpage, then according to theme correlation Relevant hyperlink is stored in topic links memory by degree, and the topic correlativity of each webpage is stored in web database In.The invention is exactly using such mode.Such crawler system has mainly focused on url filtering and Web page subject Analysis on, contents processing part is substantially using textual analysis extraction module.
Textual analysis module can substantially be divided into: the 1. text extraction algorithms 2. based on label applications are sentenced based on label densities The text extracting of the fixed 3. 4. view-based access control model web page blocks analytical technologies of Web page text extracting method based on machine learning.But whether Using which kind of algorithm, it is only used for the extraction of the trunks data such as Web page text and does not can guarantee the accuracy for extracting data. These inventive methods can preferably be used for distributed reptile system, but be confined to the algorithm relied on, be only applicable to laterally big model The fuzzy data enclosed crawls, for crawling with birth defect for specific data.Because it is in order to obtain maximum versatility, sacrificial Domestic animal customization ability, text message can only be extracted from webpage, certain types of metadata can not be but isolated from text. The citing such as commodity price in the electric business website page, the drug specifications in the network pharmacy page.Secondly, most of textual analysis is calculated Method is relative complex, and the script that comparison customizes when largely using can consume more system resources, causes under crawler system performance Drop.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of distributed network based on visualization script editing machine and climbs Worm system, can be realized efficiently customize to a large amount of particular stations and crawls while compatible universal website crawls, and solve Defect of the existing technology;User's labor intensity is reduced, system resource is saved.
The present invention is implemented as follows: a kind of distributed network crawler system based on visualization script editing machine, described System includes: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, contents processing mould Block and result memory module;
The visualization script editing machine, for checking targeted website, and selection target website data capture area;It will use The input at family is converted to an execution chain, while generating corresponding script and one database of deposit according to chain is executed;The script is For the corresponding script in targeted website;
The Distributed Message Queue is used for task scheduling modules, webpage capture module, content processing module and result Memory module is decoupled, which includes scheduling queue, crawl queue, processing queue and result queue;
The task scheduling modules, for be responsible for coordinate whole system running, read in targeted website starting URL link with User inputs information package at being passed to the scheduling queue after task, and task object is obtained in scheduling queue, and filters weight The crawl queue is sent to after multiple task;
The webpage capture module automatically parses website coding, and will grab for getting URL link from crawl queue The Content Transformation of the website taken is encoded at UTF-8, is forwarded and is sent to after the content which encodes is packaged with website relevant information Handle queue;
The content processing module uses visualization foot for getting the web page contents of website from the processing queue The URL matching rule that this editing machine generates matches the URL link of this webpage, calls this URL matching rule pair if finding matching The script answered parses this web page contents;Result after parsing is passed in result queue;
The result memory module, for taking out result data from result queue, and result data is pre- according to system The configuration of definition carries out that screening is uniformly processed, and is then stored in database.
Further, the system also includes monitoring module, the monitoring module is monitored in real time in Distributed Message Queue Scheduling queue, grab queue, handle queue, whether four queues of result queue malfunction, and when an abnormality is discovered, push disappears in time It ceases to the user interface of system, reminds user to check error reason and whether re-start script input.
Further, the system also includes text extraction module, when webpage domain name matching less than with the database In script when, call the text extraction module, to extract the corresponding script of webpage, the text extraction module is used The text extracting mode of view-based access control model web page blocks analytical technology extracts.
Further, described this script of calling is parsed;If what is generated after dissection process is new URL link, New URL link is passed to the scheduling queue, re-executes task scheduling modules;If after dissection process being result data, Then the result data after parsing is passed in result queue.
Further, the task scheduling modules include url filtering module and rate manager, the url filtering module, Duplicate removal is carried out to URL link using Bloom filter, prevents from repeating to crawl same URL link, Bloom filter is by one two A series of random mapping function compositions of system vector sum, for retrieving an element whether in a set;The rate pipe Device is managed, network congestion is prevented using token bucket algorithm, the flow of limitation outflow network makes flow with uniform speed to outgoing It send, guarantees the stability of system.
Further, the webpage capture module includes: proxy access module and browser analog module, and the agency visits It asks module, is accessed to specified URL link using preset IP agency according to user configuration information, prevent the webpage from grabbing Server ip where modulus block is blocked because amount of access is excessive by targeted website, and the browser analog module uses WebKit Open source browser engine parses targeted website, and the Javascript code being able to carry out on the page generates the complete of targeted website The whole page.
Further, the execution chain includes several subparameters, and there are many selection, the selection packets of subparameter for subparameter It includes: lower layer's URL link selection rule, the scripted code that metadata selected mark or system can be performed.
Further, the visualization script editing machine specific implementation flow is as follows:
Step 1 inputs targeted website URL link address in visualization script editor interface;
Targeted website URL link web page contents are presented in step 2, visualization script editing machine,
Step 3, if you do not need to the lower layer into this webpage URL link, then enter step 5, if necessary to enter under Layer URL link then enters step 4;
Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and deposit Enter an execution chain, all location informations are formed in the form of CSS or XPATH grammer, return step 3;
Step 5 selects several to need to grab the block of content, and deposit one executes chain,
Step 6, user confirm that editing is completed,
The execution chain recorded is passed to script generator by step 7, visualization script editing machine, produces corresponding target network The crawl script stood, while being directed to advanced level user, provides additional interface, and user can be by writing the code of compatible system, directly It is embedded among the crawl script;
Script is stored in database by step 8.
The present invention has the advantage that the visualization script editing machine of this system, can be such that unprofessional user intuitively selects The operation of user is converted into automatically specifically to handle script, by crawler by the capture area for selecting targeted sites related data Each distributed processors unit in system in operation preferentially executes these processing script dynamics, greatly reduces customization Human cost needed for changing crawler, while improving the operational efficiency of crawler system.The accuracy rate of the system grabs data is high, and With high scalability and retractility.
Detailed description of the invention
Fig. 1 is the structural schematic diagram of present system.
Fig. 2 is the work flow diagram of present system.
Fig. 3 is that visualization script editing machine of the present invention executes structural schematic diagram.
Fig. 4 is visualization script editing machine workflow schematic diagram of the present invention.
Fig. 5 is the flow chart that the present invention executes chain function mode.
Fig. 6 is the work flow diagram of the content of present invention processing module and script.
Fig. 7 is the structural schematic diagram of one embodiment of present system.
Specific embodiment
It please refers to shown in Fig. 1 to Fig. 7, a kind of distributed network crawler system based on visualization script editing machine of the invention System, the system comprises: visualization script editing machines, Distributed Message Queue, task scheduling modules, webpage capture module, interior Hold processing module and result memory module;
The visualization script editing machine, is used for visual check targeted website content, and the data of selection target website are grabbed Take region;Since it by the input of user (inputting targeted sites URL link, to all users behaviour being finally completed in editor Make the input generated) an execution chain is converted to other inessential parameters (for example whether extracting using text, if simulation is clear Look at device etc.), while generating corresponding script according to chain is executed and be stored in a database;The visualization script editing machine to use Family can check targeted website as normal browsing webpage without having programming skill.The script is that targeted website is corresponding Script;
Configuration management module provides WEB interface, and user can configure the website for needing to crawl herein, and be directed to one A or a series of website configuration schedules strategies (such as: priority periodically crawls, and climbs interval etc. again), crawl strategy (error weight Examination, enabling agency enable visit device simulation etc.) and other configurations parameter, form user configuration information.
The Distributed Message Queue is used for task scheduling modules, webpage capture module, content processing module and result Memory module is decoupled, and high distributed deployment ability is realized.The Distributed Message Queue includes scheduling queue, crawl Queue, processing queue and result queue;
The task scheduling modules coordinate the running of whole system for being responsible for, and reading in targeted website, (targeted website is For the website that carry out processing judgement) starting URL link and user input information package at being passed to the scheduling queue after task, And task object is obtained in scheduling queue, and be sent to the crawl queue after filtering iterative task;The task schedule mould Block includes url filtering module and rate manager, and the url filtering module goes URL link using Bloom filter Weight prevents from repeating to crawl same URL link, Bloom filter be actually by a very long binary vector and it is a series of with Machine mapping function composition, for retrieving an element whether in a set;Its advantages are space efficiency and query time All considerably beyond general algorithm, the disadvantage is that having certain false recognition rate and deleting difficult.It can be greatly using Bloom filter Improve system effectiveness, and it the shortcomings that crawler system will not be had an impact completely, be very suitable for crawler system use.The speed Rate manager prevents network congestion using token bucket algorithm, and the flow of limitation outflow network keeps flow outside with uniform speed It sends, guarantees the stability of system.
The webpage capture module automatically parses website coding, and will grab for getting URL link from crawl queue The Content Transformation of the website taken is encoded at UTF-8, is forwarded and is sent to after the content which encodes is packaged with website relevant information Handle queue;The webpage capture module includes: proxy access module and browser analog module, the proxy access module, With the development of network technology, nowadays more and more websites use dynamic page technology, have used a large amount of Javascript Script generates web page contents, and the webpage capture of traditional mode can only obtain the source code of the page, can not execute Javascript script leads to not the complete page for obtaining targeted sites, the difficult multiplication that data are extracted.Agency of the invention Access modules can be accessed to specified URL link using preset IP agency according to user configuration information, and the net is prevented Server ip where page handling module is blocked because amount of access is excessive by targeted website, and the browser analog module uses WebKit increases income browser engine to parse targeted website, and the Javascript code being able to carry out on the page generates target network The complete page stood.
The content processing module, for getting the web page contents of website from the processing queue, if the URL of this webpage Link matches with URL matching rule predetermined, and (user is in advance in the targeted sites URL of visualization script editing machine input Link intelligently generates a URL matching rule according to the condition of user setting by visual editor), then it calls and matches this URL The script of link parses the web page contents of website;Result after parsing is passed in result queue;The result storage Module carries out at unified for taking out result data from result queue, and by result data according to the predefined configuration of system Reason screening, is then stored in database.
Wherein, the system also includes monitoring module and text extraction module, the monitoring module real time monitoring is distributed Scheduling queue in message queue grabs queue, handles queue, and whether four queues of result queue malfunction, when an abnormality is discovered, Timely PUSH message reminds user to check error reason and whether re-starts script input to the user interface of system.
When the matching of the domain name of webpage is less than with script in the database, the text extraction module is called, into Row extracts the corresponding script of webpage, and the text extraction module uses the text extracting mode of view-based access control model web page blocks analytical technology It extracts.
In the present invention, described this script of calling is parsed;If what is generated after dissection process is new URL link, New URL link is then passed to the scheduling queue, re-executes task scheduling modules;If after dissection process being number of results According to then result data after parsing is passed in result queue.
The execution chain includes several subparameters, and for subparameter there are many selection, the selection of subparameter includes: lower layer URL Link selection rule, the scripted code that metadata selected identifies (format such as CSS, XPATH selector) or system can be performed.
As shown in Fig. 3,4,5, the visualization script editing machine specific implementation flow is as follows:
Step 1 inputs targeted website URL link address in visualization script editor interface;
Targeted website URL link web page contents are presented in step 2, visualization script editing machine,
Step 3, if you do not need to the lower layer into this webpage URL link, then enter step 5, if necessary to enter under Layer URL link then enters step 4;
Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and deposit Enter an execution chain, (visualization script editing machine will record the position of these blocks, and one execution chain concrete operations of deposit can be joined See Fig. 5) all location informations form in the form of CSS or XPATH grammer, return step 3;
Step 5 selects several to need to grab the block of content, and deposit one executes chain,
Step 6, user confirm that editing is completed,
The execution chain recorded is passed to script generator by step 7, visualization script editing machine, produces corresponding target network The crawl script stood, while being directed to advanced level user, provides additional interface, and user can be by writing the code of compatible system, directly It is embedded among the crawl script;
Script is stored in database by step 8.
It is specific as follows such as the work flow diagram that Fig. 2 is present system:
(1) task scheduling modules access configuration management module, read in starting URL link and user configuration information is packaged into and appoints Scheduling queue is passed to after business.
(2) task scheduling modules obtain task object in scheduling queue, url filtering module are inquired, if do not accessed The URL link of this task is crossed, then is sent directly to crawl queue.If accessed, the parameter for detecting user setting (is paid a return visit Time etc.), if allowing to access again, it is also sent to crawl queue, otherwise abandons this task.Finally by the task after filter weight It is sent to crawl queue.
(3) webpage capture module gets URL link from crawl queue, executes grasping manipulation, automatically parses website coding, And it changes the content of crawl into general UTF-8 coding and is sent to processing queue with the packing forwarding of website relevant information.
(4) content processing module gets web page contents from processing queue.If the information matches such as the domain name of this webpage arrive The script (script i.e. in database) that user pre-defines, then call this script to be parsed.If generated after processing Be new URL link, these link by incoming scheduling queue, reenter step (2) if it is result data, be then passed to Result queue.
(5) result memory module is taken out from result queue as a result, do according to predetermined configuration, do it is final it is unified at Reason, is restored again into database.
(6) (2)~(5) are repeated to cease and desist order until receiving system.
Such as the structural schematic diagram that Fig. 7 is one embodiment of present system.The modules of the invention can be with single machine The more way of example deployment of more examples, multimachine list example, multimachine.System i.e. of the invention can be with distributed deployment.
In addition, the transmitted data object to message queue of this system is collectively referred to as task object, a task object packet Contain: 1. content (URL link, web page contents or result data etc. change according to the difference of message queue);2. configuration ginseng Number;3. status indicator;
It is all first to take out task object, then relevant information is taken out from task object actually from message queue.
What needs to be explained here is that: the present invention in task scheduling modules, webpage capture module, content processing module and As a result memory module can start multiple examples on multiple servers, they are realized by message queue and are decoupled, Ke Yisui When stop or increase any type of example.Such design can be in the scalability and retractility of maximum lifting system.
In short, the present invention is inputted according to user by visualization interface, system automatically generated meta-data extraction script, It can identify the structure of targeted sites, efficiently grab specific data, create assigned tasks by task scheduling modules, webpage is grabbed Modulus block is responsible for grabbing the page, and it is metadata set that content processing module, which transfers corresponding script for conversion of page, is finally uniformly processed, It is stored by result memory module.The present invention can greatly improve the efficiency that crawls for particular station data, reduce and use Family labor intensity saves system resource, and possesses good scalability and retractility, is suitable for all types of internet sites Point.
The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with Modification, is all covered by the present invention.

Claims (6)

1. a kind of distributed network crawler system based on visualization script editing machine, it is characterised in that: the system comprises: it can Depending on changing script-editor, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result Memory module;
The visualization script editing machine, for checking targeted website, and selection target website data capture area;By user's Input is converted to an execution chain, while generating corresponding script and one database of deposit according to chain is executed;The script is mesh Mark the corresponding script in website;
The Distributed Message Queue, for storing task scheduling modules, webpage capture module, content processing module and result Module is decoupled, which includes scheduling queue, crawl queue, processing queue and result queue;
The task scheduling modules coordinate the running of whole system for being responsible for, read in targeted website starting URL link and user Input information package obtains task object at being passed to the scheduling queue after task in scheduling queue, and filters repetition and appoint The crawl queue is sent to after business;
The webpage capture module automatically parses website coding, and by crawl for getting URL link from crawl queue The Content Transformation of website is encoded at UTF-8, and forwarding is sent to processing after content and website relevant information which encodes are packaged Queue;
The content processing module, for getting the web page contents of website from the processing queue, if the URL link of this webpage Match with URL matching rule predetermined, then calls and match this corresponding script of URL matching rule in the webpage of website Appearance is parsed;Result after parsing is passed in result queue;
The result memory module is predefined for taking out result data from result queue, and by result data according to system Configuration carry out that screening is uniformly processed, be then stored in database;
The system also includes monitoring module, the monitoring module monitors the scheduling queue in Distributed Message Queue in real time, grabs Queue is taken, queue is handled, whether four queues of result queue malfunction, when an abnormality is discovered, the use of timely PUSH message to system Family interface reminds user to check error reason and whether re-starts script input;
The system also includes text extraction module, when the matching of the domain name of webpage is less than with script in the database, adjust With the text extraction module, to extract the corresponding script of webpage, the text extraction module uses view-based access control model webpage The text extracting mode of block analysis technology extracts.
2. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature It is: if what is generated after dissection process is new URL link, new URL link is passed to the scheduling queue, is held again Row task scheduling modules;If after dissection process being result data, the result data after parsing is passed in result queue.
3. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature Be: the task scheduling modules include url filtering module and rate manager, and the url filtering module uses the grand filtering of cloth Device carries out duplicate removal to URL link, prevents from repeating to crawl same URL link, Bloom filter is by a binary vector and one Serial random mapping function composition, for retrieving an element whether in a set;The rate manager, using token Bucket algorithm prevents network congestion, and the flow of limitation outflow network is sent out flow with uniform speed, guarantees the steady of system It is qualitative.
4. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature Be: the webpage capture module includes: proxy access module and browser analog module, the proxy access module, according to User configuration information accesses to specified URL link using preset IP agency, prevents the webpage capture module place Server ip is blocked because amount of access is excessive by targeted website, the browser analog module, uses WebKit open source browser Engine parses targeted website, and the Javascript code being able to carry out on the page generates the complete page of targeted website.
5. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature Be: the execution chain includes several subparameters, and for subparameter there are many selection, the selection of subparameter includes: lower layer's URL link Selection rule, the scripted code that metadata selected mark or system can be performed.
6. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature Be: the visualization script editing machine specific implementation flow is as follows:
Step 1 inputs targeted website URL link address in visualization script editor interface;
Targeted website URL link web page contents are presented in step 2, visualization script editing machine,
Step 3, if you do not need to the lower layer into this webpage URL link, then enter step 5, if necessary to enter lower layer URL Link then enters step 4;
Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and be stored in one Chain is executed, all location informations are formed in the form of CSS or XPATH grammer, return step 3;
Step 5 selects several to need to grab the block of content, and deposit one executes chain,
Step 6, user confirm that editing is completed,
The execution chain recorded is passed to script generator by step 7, visualization script editing machine, produces corresponding targeted website Script is grabbed, while being directed to advanced level user, additional interface is provided, user can be directly embedded by writing the code of compatible system Among the crawl script;
Script is stored in database by step 8.
CN201510713985.7A 2015-10-28 2015-10-28 A kind of distributed network crawler system based on visualization script editing machine Active CN105243159B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510713985.7A CN105243159B (en) 2015-10-28 2015-10-28 A kind of distributed network crawler system based on visualization script editing machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510713985.7A CN105243159B (en) 2015-10-28 2015-10-28 A kind of distributed network crawler system based on visualization script editing machine

Publications (2)

Publication Number Publication Date
CN105243159A CN105243159A (en) 2016-01-13
CN105243159B true CN105243159B (en) 2019-06-25

Family

ID=55040807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510713985.7A Active CN105243159B (en) 2015-10-28 2015-10-28 A kind of distributed network crawler system based on visualization script editing machine

Country Status (1)

Country Link
CN (1) CN105243159B (en)

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10701087B2 (en) * 2015-11-02 2020-06-30 Nippon Telegraph And Telephone Corporation Analysis apparatus, analysis method, and analysis program
CN107180050A (en) * 2016-03-11 2017-09-19 精硕科技(北京)股份有限公司 A kind of data grabber system and method
CN106886547A (en) * 2016-07-13 2017-06-23 阿里巴巴集团控股有限公司 A kind of scenario generation method and device
CN106168985A (en) * 2016-08-26 2016-11-30 南京车易淘网络信息技术有限公司 A kind of can the reptile method of fast distributed deployment
CN108228614B (en) * 2016-12-14 2022-03-18 北京国双科技有限公司 Method and device for detecting webpage broken link
CN106933973A (en) * 2017-02-14 2017-07-07 广州优亿信息科技有限公司 A kind of visual network reptile method
CN106980687B (en) * 2017-03-31 2020-05-22 北京奇艺世纪科技有限公司 Resource downloading system, method and crawler downloading system
CN107103242B (en) * 2017-05-11 2020-07-17 北京安赛创想科技有限公司 Data acquisition method and device
CN107317724B (en) * 2017-06-06 2020-12-11 中证信用增进股份有限公司 Data acquisition system and method based on cloud computing technology
CN110020066B (en) * 2017-07-31 2021-09-07 北京国双科技有限公司 Method and device for annotating tasks to crawler platform
CN107870965A (en) * 2017-08-11 2018-04-03 成都萌想科技有限责任公司 One kind visualization data collecting system
CN107577788B (en) * 2017-09-15 2021-12-31 广东技术师范大学 E-commerce website topic crawler method for automatically structuring data
CN108108440A (en) * 2017-12-21 2018-06-01 北京慧数科技有限公司 The acquisition method of proxy server and internet data
CN108549678B (en) * 2018-04-02 2020-06-19 北京今朝在线科技有限公司 Information acquisition system
CN109285046A (en) * 2018-08-10 2019-01-29 浙江工业大学 A kind of electric business big data acquisition system based on business plug-in unit
CN108875091B (en) * 2018-08-14 2020-06-02 杭州费尔斯通科技有限公司 Distributed web crawler system with unified management
CN109101636A (en) * 2018-08-16 2018-12-28 成都市映潮科技股份有限公司 A kind of method, apparatus and system carrying out data acquisition in cloud by visual configuration
CN109284430A (en) * 2018-09-07 2019-01-29 杭州艾塔科技有限公司 Visualization subject web page content based on distributed structure/architecture crawls system and method
CN109522466B (en) * 2018-10-20 2023-04-07 河南工程学院 Distributed crawler system
CN109783715A (en) * 2019-01-08 2019-05-21 鑫涌算力信息科技(上海)有限公司 Network crawler system and method
CN109614539A (en) * 2019-01-16 2019-04-12 重庆金融资产交易所有限责任公司 Data grab method, device and computer readable storage medium
CN109948026A (en) * 2019-03-28 2019-06-28 深信服科技股份有限公司 A kind of web data crawling method, device, equipment and medium
CN110807137A (en) * 2019-04-11 2020-02-18 上海丛云信息科技有限公司 Distributed big data acquisition implementation method
CN110020062B (en) * 2019-04-12 2021-09-24 北京邮电大学 Customizable web crawler method and system
CN110297962B (en) * 2019-06-28 2021-08-24 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN110457556B (en) * 2019-07-04 2023-11-14 重庆金融资产交易所有限责任公司 Distributed crawler system architecture, method for crawling data and computer equipment
CN110413276B (en) * 2019-07-31 2024-04-09 网易(杭州)网络有限公司 Parameter editing method and device, electronic equipment and storage medium
CN110851681B (en) * 2019-10-12 2024-07-09 平安科技(深圳)有限公司 Crawler processing method, crawler processing device, server and computer readable storage medium
CN112783615B (en) * 2019-11-08 2024-03-01 北京沃东天骏信息技术有限公司 Data processing task cleaning method and device
CN111045659A (en) * 2019-11-11 2020-04-21 国家计算机网络与信息安全管理中心 Method and system for collecting project list of Internet financial webpage
CN111178057B (en) * 2020-01-02 2024-01-30 大汉软件股份有限公司 Content analysis and extraction system for government electronic documents
CN111310002B (en) * 2020-04-17 2023-04-07 西安热工研究院有限公司 General crawler system based on distributor and configuration table combination
CN111651656B (en) * 2020-06-02 2023-02-24 重庆邮电大学 Method and system for dynamic webpage crawler based on agent mode
CN112100061A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Visual crawler code compiling and debugging method
CN112256636A (en) * 2020-11-10 2021-01-22 国网湖南省电力有限公司 Data acquisition system for mobile application APP
CN112364226A (en) * 2020-11-12 2021-02-12 江苏易启策网络科技有限公司 Interactive information acquisition method and system based on dynamic content analysis
CN112434205A (en) * 2020-11-30 2021-03-02 北京秒针人工智能科技有限公司 Data integration capturing method and system based on data site and computer equipment
CN112667873A (en) * 2020-12-16 2021-04-16 北京华如慧云数据科技有限公司 Crawler system and method suitable for general data acquisition of most websites
CN112487269B (en) * 2020-12-22 2023-10-24 安徽商信政通信息技术股份有限公司 Method and device for detecting automation script of crawler
CN112328238B (en) * 2021-01-05 2021-03-30 深圳点猫科技有限公司 Building block code execution control method, system and storage medium
CN112818201A (en) * 2021-02-07 2021-05-18 四川封面传媒有限责任公司 Network data acquisition method and device, computer equipment and storage medium
CN113742550B (en) * 2021-08-20 2024-04-19 广州市易工品科技有限公司 Browser-based data acquisition method, device and system
CN113656674B (en) * 2021-08-30 2023-06-27 山谷网安科技股份有限公司 Automatic processing method and device for click type hyperlink in website crawler
CN113934912A (en) * 2021-11-11 2022-01-14 北京搜房科技发展有限公司 Data crawling method and device, storage medium and electronic equipment
CN113918793A (en) * 2021-12-10 2022-01-11 江苏宝和数据股份有限公司 Multi-source scientific and creative resource data acquisition method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001033345A1 (en) * 1999-11-02 2001-05-10 Alta Vista Company System and method for enforcing politeness while scheduling downloads in a web crawler
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN104765592A (en) * 2014-01-03 2015-07-08 任子行网络技术股份有限公司 Plugin management method and device facing web page acquisition task

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001033345A1 (en) * 1999-11-02 2001-05-10 Alta Vista Company System and method for enforcing politeness while scheduling downloads in a web crawler
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN104765592A (en) * 2014-01-03 2015-07-08 任子行网络技术股份有限公司 Plugin management method and device facing web page acquisition task

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
搜索引擎中通用爬虫系统的研究与设计;高龙;《中国优秀硕士学位论文全文数据库信息科技辑》;20130915(第09期);第I138-518页

Also Published As

Publication number Publication date
CN105243159A (en) 2016-01-13

Similar Documents

Publication Publication Date Title
CN105243159B (en) A kind of distributed network crawler system based on visualization script editing machine
CN101222349B (en) Method and system for collecting web user action and performance data
CN107317724B (en) Data acquisition system and method based on cloud computing technology
CN101651707B (en) Method for automatically acquiring user behavior log of network
Kung et al. An object-oriented web test model for testing web applications
CN106021257B (en) A kind of crawler capturing data method, apparatus and system for supporting online programming
CN107391775A (en) A kind of general web crawlers model implementation method and system
US20120265824A1 (en) Method and system for configuration-controlled instrumentation of application programs
CN104077402B (en) Data processing method and data handling system
CN107071009A (en) A kind of distributed big data crawler system of load balancing
CN106096056A (en) A kind of based on distributed public sentiment data real-time collecting method and system
CN107885777A (en) A kind of control method and system of the crawl web data based on collaborative reptile
CN105260388A (en) Optimization method of distributed vertical crawler service system
CN107729564A (en) A kind of distributed focused web crawler web page crawl method and system
CN102262635A (en) Page crawler system and page crawler method
CN107766509A (en) A kind of method and apparatus of webpage static backup
CN109359231A (en) A kind of information crawler method, server and the storage medium of distributed network crawler
CN107239563A (en) Public feelings information dynamic monitoring and controlling method
CN109840298A (en) The multi information source acquisition method and system of large scale network data
US20130036108A1 (en) Method and system for assisting users with operating network devices
CN112307292A (en) Information processing method and system based on advanced persistent threat attack
CN112395485A (en) Policy big data mining method and device, computer equipment and storage medium
CN103905434A (en) Method and device for processing network data
Smith Go Web Scraping Quick Start Guide: Implement the power of Go to scrape and crawl data from the web
CN108205548A (en) A kind of Web Spider structure and its method of work based on agriculture webpage information acquisition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant